Wang Zheng Yuan

What is dumbo and how to ssh into dumbo for running jobs?

Dumbo is the stand alone Hadoop cluster running on the Hortonworks Data platform. It can be used to perform various mapreduce jobs for big data analytics.

To access dumbo: I recommend using the Mac OS in the class room. The PC's provided have a Mac OS option.

Please follow the instructions on this link:

https://wikis.nyu.edu/display/NYUHPC/Clusters+-+Dumbo#Clusters-Dumbo-Dumbo-Hadoopcluster

or

MAC OS users Only

Make sure to follow the instructions for Web UI access using the above like

Icon

cd /Users/NetID
mkdir .ssh
cd .ssh
touch config
vi config

(for mac users only) Copy and paste the below into /.ssh/config/

Icon

Host hpctunnel

HostName hpc.nyu.edu

ForwardX11 yes

LocalForward 8025 dumbo.es.its.nyu.edu:22

User NetID

DynamicForward 8118

Host dumbo

HostName localhost

Port 8025

ForwardX11 yes

User NetID

What is Hadoop?

Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS and Hadoop's implementation of MapReduce.

What is HDFS?

HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

What is Map-Reduce?

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. The mapping and reducing functions receive not just values, but (key, value) pairs.This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability.

Every MapReduce job consists of at-least three parts:

The driver
The Mapper
The Reducer

Mapping Phase

The first phase of a MapReduce program is called mapping. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.

The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.

Mapping creates a new output list by applying a function to individual elements of an input list.

Reducing Phase

Reducing let's you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.

The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem.

Reducing a list iterates over the input values to produce an aggregate value as output.

MapReduce Data Flow

What are the components of the dumbo Cluster @NYU and what can they be used for?

Lets see the UIs for a better understanding:

Ambari Manager: http://babar.es.its.nyu.edu:8080/

http://demo.gethue.com/#

Commands for HDFS & MapReduce:

Transferring Data from workstation to dumbo :

Icon

my_workstation$ scp my_file dumbo:

my_workstation$ ssh dumbo

HDFS COMMANDs

TO UPLOAD DATA TO HDFS

Icon

hadoop fs -put <filename_in_lfs> <hdfs_name>

or

hadoop fs -copyFromLocal <filename_in_lfs> <hdfs_name>

or

hdfs dfs -put <filename_in_lfs> <hdfs_name>

TO GET DATA FROM HDFS

Icon

hadoop fs -get <hdfs_name> <filename_in_lfs>

or

hadoop fs -copyToLocal <hdfs_name> <filename_in_lfs>

TO CHECK HDFS FOR YOUR FILE

Icon

hadoop fs -ls

MAPREDUCE COMMANDS

TO COMPILE JAVA FILES

Icon

javac -cp $(yarn classpath) my_code.java

or

javac -classpath /share/apps/examples/Tutorial1/hadoop-core-1.2.1.jar *.java

TO MAKE THE JAR FILE

Icon

jar cvf <jarfilename>.jar *.class

TO TRIGGER THE JOB

Icon

hadoop jar <jarfilename>.jar <DriverClassName> <ip_file_in_HDFS> <op_dir_name>

TO CHECK RUNNING JOB

Icon

hadoop job -list

TO KILL THE JOB

Icon

hadoop job -kill <job_id>

Example Map-Reduce job:

Word Count: The objective here is to count the number of occurrences of each word by using key-value pairs.

Step 1:

ssh into dumbo

Step 2:

Move to

Icon

cd /share/apps/examples/Tutorial1/example1

It includes 4 files

Example.txt ------ Input file

SumReducer.java ------ This is the reducer

WordMapper.java ------ This is the mapper

WordCount.java ------- This is the driver

WordCount.jar - Complied jar file used to run the mapreduce job

Step 3:

Copy example1 folder to /home/user/example1

Icon

cp -r example1 /home/netid/

Step 3:

Place the example.txt file on to hdfs

Icon

hadoop fs -put example.txt example.txt

Step 4:

Run the mapreduce job using WordCount.jar

Icon

hadoop jar example.jar wordcount example.txt wordcountoutput

Step 5:

Check output by accessing HDFS directories

Icon

hadoop fs -get wordcountoutput

2. Standard Deviation : The objective is to find the standard deviation of the length of the words.

Step 1:

Move to

Icon

cd /share/apps/examples/Tutorial1/example2

example2.txt - Input file

StandardDeviation.jar - compiled jar file

Step 2:

copy example2 folder to /home/user/example2

Icon

cp -r example2 /home/netid/

Step 3:

Place the example2.txt file on to hdfs

Icon

Hadoop fs -put example2.txt example2.txt

Step 4:

Run the mapreduce job using StandardDeviation.jar

Icon

hadoop jar StandardDeviation.jar wordstandarddeviation example2.txt standarddeviationoutput

Step 5:

Check output by accessing HDFS directories

Icon

hadoop fs -get standarddeviationoutput

3. Sudoku Solver : The objective is to solve the given sudoku puzzle by using mapreduce

Step 1:

Move to

Icon

cd /share/apps/examples/Tutorial1/example3

Sudoku.dft - Puzzle

sudoku.jar - Compiled jar file

Step 2:

copy example3 folder to /home/user/

Icon

cp -r example3 /home/netid/

Step 3:

Run the mapreduce job

Icon

hadoop jar sudoku.jar sudoku.dft

(Note: Twitter Sentiment analysis can be done using this cluster. It requires the use of java for mapreduce and pig script for sorting the twitter users based on number of tweets. The next steps would be setting up oozie workflow and observe the analysis on Hue. To learn more about sentiment analysis please contact hpc@nyu.edu)

MapReduce Streaming

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python, shell scripts or C++. Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.

Streaming runs a MapReduce Job from the command line. You specify a map script, a reduce script, an input and an output. Streaming takes care of the Map Reduce details such as making sure that your job is split into separate tasks, that the map tasks are executed where the data is storedt Hadoop Streaming works a little differently (your program is not presented with one record at a time, you have to iterate yourself)

- - - - -input – The data in hdfs that you want to process
      - -output – The directory in hdfs where you want to store the output
      - -map script – the program script command line or process that you want to use for your mapper
      - -reduce script – the program script command or process that you want to use for your reducer.
        
        Icon
        The streaming jar is located at /share/apps/examples/Tutorial1/hadoop-streaming-2.6.0.2.2.0.0-2041.jar on dumbo.
        
        Command used to run a mapreduce job using streaming:
        Icon
        
        hadoop jar /share/apps/examples/Tutorial1/hadoop-streaming-2.6.0.2.2.0.0-2041.jar -input streamingexample.txt -output streamout1 -mapper mapper.py -reducer reducer.py -numReduceTasks 2
        
        (Note: R can also be used to run mapreduce jobs. Please contact hpc@nyu.edu to learn more)

Wang Zheng Yuan

Thursday, June 18, 2015

Big Data Tutorial 1: MapReduce

No comments:

Post a Comment