Wednesday, October 11, 2017

MapReduce through Pydoop

Hi,

Today I am going to share my experience in running MapReduce wordcount program through pydoop (python) installed in UBUNTU 16.04.

I have been struggling with Hadoop for about a year. Now I am okay in running HDFS, MapReduce. However, yet to get thorough of the same.

I tried pydoop with my naive-like attitude. First I was trying to install pydoop in Ubuntu terminal (my favourite OS), but in vain. I wasted almost all day at my office. I could not figure out as how to erect pydoop since the office system works behind proxy :(

I came home and started doing the same in my PC (HP laptop). I first did the following statement in the terminal but found helpless...

sudo pip install pydoop. 

I found a problem regarding "HADOOP_HOME". I somehow understood that the pydoop is trying to figure-out Hadoop installation directly but fails. I got certain solution through forums  (promise, I forgot where it is....). I changed the command as below.

sudo pip -E install pydoop 

That "E" did the trick. Now pydoop set right properly in my laptop as I found from the output message.

As I already have Hadoop properly installed in pseudo-distribution mode (single node deployment). I did the following steps in terminal.

  1. I created a small text file in which I have some arbitrary text.  
  2. Uploaded to HDFS (directory)
  3. Created two folder in HDFS directory such as ../input and ../output in the same directory
  4. Wrote pydoop script file for word counting
  5. Executed the script. 
My text file is as below: 
I created the folders in the one of the current directories in the HDFS.

hdfs dfs -ls /dir2
hdfs dfs -mkdir /dir2/input /dir2/output

hdfs dfs -copyFromLocal text.txt /dir2/input

Now I prepared the script file (script.py) with the following code.

def mapper(_, text, writer):
    for word in text.split():
        writer.emit(word, "1")

def reducer(word, icounts, writer):
    writer.emit(word, sum(map(int, icounts)))


Went back to the terminal and executed the following statement.

pydoop script script.py /dir2/input/text.txt /dir2/output/output_

Checked the results in the output folder of the directory.


BINGO!!! I am now a big data expert! Hey! Hey! Hey.














































No comments:

Post a Comment