Hadoop introduction and MapReduce with Python

A tutorial for people who want to get started with Hadoop but have trouble writing in Java.

Since Hadoop is written in Java, Mapper / Reducer is also basically written in Java, but Hadoop has a function called Hadoop Streaming, which allows data to be exchanged via Unix standard input / output. .. I wrote Mapper / Reducer in Python using this. Of course, if you use Hadoop Streaming, you can write in languages other than Python.

This time, I built a pseudo-distributed environment on Ubuntu.

Ubuntu12.04 + Haadoop2.4.1

Hadoop environment construction

Install if you don't have Java

$ sudo apt-get update
$ sudo apt-get install openjdk-7-jdk

Download Hadoop

$ wget http://mirror.nexcess.net/apache/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz
$ tar zxvf hadoop-2.4.1.tar.gz
$ mv hadoop-2.4.1.tar.gz hadoop
$ rm hadoop-2.4.1.tar.gz
$ sudo mv hadoop /usr/local
$ cd /usr/local/hadoop
$ export PATH=$PATH:/usr/local/hadoop/bin #.It is good to write it in zshrc

Edit the following 4 files

$ vim etc/hadoop/core-site.xml

core-site.xml


...
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>
$ vim etc/hadoop/hdfs-site.xml

hdfs-site.xml


...
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
$ mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
$ vim etc/hadoop/mapred-site.xml

mapred-site.xml


...
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
$ vim etc/hadoop/hadoop-env.xml

hadoop-env.xml


...
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
...

If you don't have the key, add it

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

Finally initialize namenode and start Hadoop

$ hdfs namenode -format
$ sbin/start-dfs.sh

Writing Mapper / Reducer in Python

This time, write Hadoop sample code WordCount in Python.

First, prepare the input file

$ mkdir inputs
$ echo "a b b c c c" > inputs/input.txt

Mapper

$ vim mapper.py

mapper.py


#!/usr/bin/env python

import sys

for l in sys.stdin:
    for word in l.strip().split(): print '{0}\t1'.format(word)

Mapper outputs something like the following

a    1
b    1
b    1
c    1
c    1
c    1

Reducer

$ vim reducer.py

reducer.py


#!/usr/bin/env python

from collections import defaultdict
from operator import itemgetter
import sys

wordcount_dict = defaultdict(int)

for l in sys.stdin:
    word, count = line.strip().split('\t')
    wordcount_dict[word] += int(count)

for word, count in sorted(wordcount_dict.items(), key=itemgetter(0)):
    print '{0}\t{1}'.format(word, count)

Reducer counts each word output by Mapper and outputs something like the following

a    1
b    2
c    3

Execution by Hadoop Streaming

Finally run the above Mapper / Reducer on Hadoop

First. Download the jar file for Hadoop Streaming

$ wget http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-streaming/2.4.1/hadoop-streaming-2.4.1.jar

Create a directory on HDFS and put the input file on it (Be careful not to mess up local files with HDFS files)

$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/vagrant
$ hdfs dfs -put inputs/input.txt /user/vagrant

When executed, the result is stored in the specified output directory.

$ hadoop jar hadoop-streaming-2.4.1.jar -mapper mapper.py -reducer reducer.py -input /user/vagrant/input.txt -output outputs
$ hdfs dfs -cat /user/vagrant/outputs/part-00000
a    1
b    2
c    3


Recommended Posts

Hadoop introduction and MapReduce with Python
Programming with Python and Tkinter
Encryption and decryption with Python
Python and hardware-Using RS232C with Python-
python with pyenv and venv
Works with Python and R
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
[Python] font family and font with matplotlib
[Introduction to Python3 Day 1] Programming and Python
Scraping with Node, Ruby and Python
Scraping with Python, Selenium and Chromedriver
Scraping with Python and Beautiful Soup
JSON encoding and decoding with python
[GUI with Python] PyQt5-Drag and drop-
Reading and writing NetCDF with Python
I played with PyQt5 and Python3
Reading and writing CSV with Python
Multiple integrals with Python and Sympy
Coexistence of Python2 and 3 with CircleCI (1.0)
Easy modeling with Blender and Python
Sugoroku game and addition game with python
FM modulation and demodulation with Python
Communicate between Elixir and Python with gRPC
Data pipeline construction with Python and Luigi
Calculate and display standard weight with python
Monitor Mojo outages with Python and Skype
[Automation] Manipulate mouse and keyboard with Python
Passwordless authentication with RDS and IAM (Python)
Python installation and package management with pip
Introduction to Python Image Inflating Image inflating with ImageDataGenerator
POST variously with Python and receive with Flask
Capturing images with Pupil, python and OpenCV
Fractal to make and play with Python
A memo with Python2.7 and Python3 on CentOS
Easy introduction of speech recognition with Python
[Introduction to Python] Let's use foreach with Python
Use PIL and Pillow with Cygwin Python
[Python] Introduction to CNN with Pytorch MNIST
Create and decrypt Caesar cipher with python
CentOS 6.4 with Python 2.7.3 with Apache with mod_wsgi and Django
Reading and writing JSON files with Python
Dealing with "years and months" in Python
I installed and used Numba with Python3.5
Tweet analysis with Python, Mecab and CaboCha
Linking python and JavaScript with jupyter notebook
Traffic monitoring with Kibana, ElasticSearch and Python
FM modulation and demodulation with Python Part 2
Encrypt with Ruby (Rails) and decrypt with Python
Easily download mp3 / mp4 with python and youtube-dl!
Operate home appliances with Python and IRKit
Clean python environment with pythonz and virtualenv
Practice web scraping with Python and Selenium
Easy introduction of python3 series and OpenCV3
Easy web scraping with Python and Ruby
Importing and exporting GeoTiff images with Python
I'm using tox and Python 3.3 with Travis-CI