[PYTHON] PySpark 1.5.2 + Elasticsearch 2.1.0 Installation procedure and execution

Introduction

--I want to touch Elasticsearch from pyspark

environment

Spark installation

Omit. Spark 1.6 was released today, but with 1.5.2.

Elasticsearch + hadoop download

--As of January 6, 2016, Elasticsearch 2.1.0 requires elasticsearch-hadoop-2.2.0-beta1.

Just download from Official Page and unpack

$ wget http://download.elastic.co/hadoop/elasticsearch-hadoop-2.2.0-beta1.zip
$ unzip elasticsearch-hadoop-2.2.0-beta1.zip

Start pyspark + elasticsearch

/usr/local/share/spark/bin/pyspark --master local[4] --driver-class-path=elasticsearch-hadoop-2.2.0-beta1/dist/elasticsearch-spark_2.11-2.2.0-beta1.jar

RDD generation

>>> conf = {"es.nodes" : "XXX.XXX.XXX.XXX:[port]", "es.resource" : "[index name]/[type]"}
>>> rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat","org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)

basic operation

>>> rdd.first()
>>> rdd.count()
>>> rdd.filter(lambda s: 'aaa' in s).count()

Map / Reduce

#Count how many records there are by name
counts = rdd.map(lambda item: item[1]["name"])
counts = counts.map(lambda ip: (ip, 1))
counts = counts.reduceByKey(lambda a, b: a+b)

#Run
>>> counts.collect()

Save to ES

rdd.saveToEs('test/docs')

I was addicted to

--Be careful about Network settings on Elasticsearch side. If `` `network.publish_host``` is incorrect, the connection was rejected and an error occurred. -Remote access about Spark and Elasticsearch was helpful.

<snip>
File "/usr/local/share/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [GET] on [_nodes/http] failed; server[hostname/XXX.XXX.XXX.XXX:Ports] returned [400|Bad Request:]
<snip>

reference

Spark -Fun visualization: Encounter between elasticsearch and Spark Streaming

Elasticsearch

Recommended Posts

PySpark 1.5.2 + Elasticsearch 2.1.0 Installation procedure and execution
About errors during PyInstaller installation and execution
Pylearn 2 installation procedure
PostgreSQL 10.0 installation procedure
blockdiag installation procedure
[Ansible installation procedure] From installation to execution of playbook
AWS CLI installation procedure
django-debug-toolbar installation procedure memo
jupyter and pandas installation
ubuntu20.04 + Geth installation procedure
Anaconda3 environment installation procedure
Scrapy environment installation procedure
[Python] Chapter 01-02 About Python (Execution and installation of development environment)
Installation procedure for Python and Ansible with a specific version
Offline installation procedure for openpyxl
Python CMS Mezzanine installation procedure
Python 3.6 installation procedure [for Windows]
ubuntu 20.04 + VirtalBox installation procedure summary
Python installation and basic grammar
Django installation and operation check
Scrapy installation troubles and solutions
Tomcat installation and autostart settings
Source installation and installation of Python
Python (Python 3.7.7) installation and basic grammar
Getting Started with Poetry From installation to execution and version control
Installation procedure + usage of "virtualenv" and "pythonz" in Mountain Lion environment