Wrapper running Hadoop in Python

I want to do Hadoop machine learning with Python

Since you can write Hadoop Jobs other than Java, I made a wrapper called SkipJack that can implement Python that is strong in machine learning with Hadoop at Python Mokumokukai and New Year.

GitHub is below. (No pip) GitHub-SkipJack

Details below

  1. HadoopStreaming
  2. Scikit-learn
  3. SkipJack

HadoopStreaming

In Hadoop

--Run Java on slave part (Haoop MR Tutorial) --Execute files via standard I / O on the slave part (Hadoop Streaming Tutorial)

There are two execution methods, Hadoop can be used in all languages that can handle standard I / O. (Hadoop Streaming)

So you don't have to use Mahout just because you're doing machine learning in Hadoop, You can implement it in your favorite library using Python, which is strong in machine learning.

For the general flow of preparing Hadoop, refer to Introduction of Hadoop and MapReduce by Python.

Scikit-learn

The most major machine learning library implemented in Python. In order to use this, you need to install Numpy and Scipy as well, but pip alone cannot easily install it, so I downloaded the 3 series of Anaconda which contains a set of libraries from the beginning and installed it on all slaves.

SkipJack

In Hadoop Streaming, the execution command of hadoop had to be typed by hand, which was troublesome. By running python

** Decide the Job to be executed → Hadoop execution → Result evaluation → Determine the next Job to be executed → Below, loop until stop **

I made a wrapper that can do. If you implement mapper, reducer, and result evaluation method, you do not need to write routine work.

The contents are It's as simple as running a Hadoop command (run, file placement (put), read result (cat)).

In the sample,

--WordCount + Alpha --Refine using grid search

We have prepared two.

Recommended Posts

Wrapper running Hadoop in Python
Method_missing-like wrapper in Python
Quadtree in Python --2
Python in optimization
CURL in python
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
The basics of running NoxPlayer in Python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Get files, functions, line numbers running in python
Sorted list in Python
Daily AtCoder # 36 in Python
Clustering text in Python
Daily AtCoder # 2 in Python
Implement Enigma in python
Daily AtCoder # 32 in Python
Daily AtCoder # 6 in Python
python syslog wrapper class
Daily AtCoder # 18 in Python
Edit fonts in Python
Singleton pattern in Python
File operations in Python
Read DXF in python
Daily AtCoder # 53 in Python
Key input in Python
Use config.ini in Python
Daily AtCoder # 33 in Python
Solve ABC168D in Python