Nombre de mots avec Apache Spark et python (Mac OS X)

Aperçu

Comme première étape de la vérification d'Apache Spark. Comme le savent bien tous ceux qui ont de l'expérience avec Hadoop, c'est celui qui compte les mêmes mots dans le fichier. L'environnement est Mac OSX, mais je me demande s'il en est presque de même pour Linux. Le code complet est ici.

Installation

$ brew install apache-spark

Confirmation d'installation

OK si Spark-Shell fonctionne et que `` scala> '' est affiché

$ /usr/local/bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:47 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:51 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:51 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/07 16:44:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/04/07 16:44:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.

scala>

Essayez le comptage de mots des fichiers locaux avec python

Ceci a été écrit en référence à la description sur le Site officiel.

Structure du répertoire

Veuillez préparer comme suit.

$ tree
.
├── input
│   └── data #Texte à lire
└── wordcount.py #Script d'exécution

1 directory, 4 files

Ecrire le code

Ici, nous utilisons python. Vous pouvez écrire en scala ou en Java. Je suis bon, alors allons-y. Comme ça.

`wordcount.py`


#!/usr/bin/env python
# coding:utf-8

from pyspark import SparkContext

def execute(sc, src, dest):
    '''
Effectuer le comptage des mots
    '''
    #Lire le fichier src
    text_file = sc.textFile(src)
    counts = text_file.flatMap(lambda line: line.split(" ")) \
                 .map(lambda word: (word, 1)) \
                 .reduceByKey(lambda a, b: a + b)
    #Exporter les résultats
    counts.saveAsTextFile(dest)

if __name__ == '__main__':
    sc = SparkContext('local', 'WordCount')
    src  = './input'
    dest = './output'
    execute(sc, src, dest)

Lire la préparation du fichier

De manière appropriée. Par exemple, comme ça.

`./input/data`


aaa
bbb
ccc
aaa
bbbb
ccc
aaa

Courir

La commande suivante.

$ which pyspark
/usr/local/bin/pyspark

#Courir
$ pyspark ./wordcount.py

Lorsque vous l'exécutez, un journal coulera. (Comme le streaming Hadoop)

Vérification

`./output/part-00000`


(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)

Il a été correctement compté.

prime

Notez que si le répertoire de destination de sortie (./output) a déjà été généré, le processus suivant échouera. C'est une bonne idée d'attacher un shell comme celui ci-dessous au même répertoire.

`exec.sh`


#!/bin/bash

rm -fR ./output
/usr/local/bin/pyspark ./wordcount.py

echo ">>>>> result"
cat ./output/*

$ sh exec.sh
・ ・ ・
>>>>> result
(u'aaa', 3)
(u'bbbb', 1)
(u'bbb', 1)
(u'ccc', 2)