[PYTHON] PySpark 1.5.2 + Elasticsearch 2.1.0 Procédure d'installation et exécution

introduction

--Je veux toucher Elasticsearch de pyspark

environnement

Elasticsearch 2.1.0
Spark 1.5.2

Installation d'étincelles

Omettre. Spark 1.6 est sorti aujourd'hui, mais avec la version 1.5.2.

Télécharger Elasticsearch + Hadoop

Depuis le 6 janvier 2016, Elasticsearch 2.1.0 nécessite elasticsearch-hadoop-2.2.0-beta1.

Il suffit de télécharger à partir de la page officielle et de décompresser

$ wget http://download.elastic.co/hadoop/elasticsearch-hadoop-2.2.0-beta1.zip
$ unzip elasticsearch-hadoop-2.2.0-beta1.zip

Lancer la recherche pyspark + élastique

/usr/local/share/spark/bin/pyspark --master local[4] --driver-class-path=elasticsearch-hadoop-2.2.0-beta1/dist/elasticsearch-spark_2.11-2.2.0-beta1.jar

Génération RDD

>>> conf = {"es.nodes" : "XXX.XXX.XXX.XXX:[port]", "es.resource" : "[index name]/[type]"}
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat","org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)

opération de base

>>> rdd.first()
>>> rdd.count()
>>> rdd.filter(lambda s: 'aaa' in s).count()

Map / Reduce

#Comptez le nombre d'enregistrements par nom
counts = rdd.map(lambda item: item[1]["name"])
counts = counts.map(lambda ip: (ip, 1))
counts = counts.reduceByKey(lambda a, b: a+b)

#Courir
>>> counts.collect()

Enregistrer dans ES

rdd.saveToEs('test/docs')

J'étais accro à

Faites attention aux paramètres réseau du côté Elasticsearch. Si `` network.publish_host '' est incorrect, la connexion a été rejetée et une erreur s'est produite.
Accès à distance sur Spark et Elasticsearch a été utile.

<snip>
File "/usr/local/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [_nodes/http] failed; server[hostname/XXX.XXX.XXX.XXX:Ports] returned [400|Bad Request:]
<snip>

référence

Spark

Elasticsearch

Apache Spark support
Remote access about Spark and Elasticsearch
Problème de découverte d'AWS. Le paramètre network.publish_host est important.
Paramètres Elasticsearch.yml
Elasticsearch Unplugged - Modifications du réseau dans la version 2.0 (traduction japonaise)