[PYTHON] Easy to install pyspark with conda

1. Background / Target

A memo for running pyspark using conda in the local environment. Install and run pyspark just like any other popular Python library.

Main target to assume:

――I want to create an environment that works with a few steps, leaving detailed settings aside. ――It would be good if you could move the sample code of online articles and reference books, and code and develop functions using small-scale test data for the time being. -Download from Official Site or Mirror and PATH It is troublesome to install Java through and PYTHONPATH, and so on. -** I don't want to write or manage things like bashrc ** --I want to manage Spark and Java versions separately for each virtual environment --I want to distinguish between Java used on the PC and Java used on Spark. --I want to use Spark 2.4 and Spark 3.0 properly (or I want to install Spark separately for each project) ――But I don't want to use Docker or virtual machines

I am thinking about the situation.

2. Install Spark and Java with conda

Enter the target conda virtual environment and

--When using Apache Spark 3.0

conda install -c conda-forge pyspark=3.0 openjdk=8

--When using Apache Spark 2.4

#Note: Python3.8 is not supported, so Python 3.7.Use an environment such as x
conda install -c conda-forge pyspark=2.4 openjdk=8

Then, not only the pyspark library but also Apache Spark itself will be installed under the virtual environment. (By the way, pandas and pyarrow, which handles data linkage between pandas and Spark, are also included.)

** At this point, you should be able to use pyspark for the time being. ** **

By the way, if you insert openjdk with conda as in the above example, when you enter the virtual environment with conda activate, JAVA_HOME will be automatically set to match the one entered with conda. (If you enter from the conda-forge channel, the version will be 1.8.0_192 (Azul Systems, Inc.) as of 2020-08-14.)

Run

conda activate <virtual environment name> and then on the CLI

shell (conda environment)


$ pyspark                                           
Python 3.8.5 (default, Aug  5 2020, 08:36:46) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
20/08/14 22:00:15 WARN Utils: Your hostname, <***> resolves to a loopback address: 127.0.1.1; using 192.168.3.17 instead (on interface wlp3s0)
20/08/14 22:00:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/14 22:00:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/

Using Python version 3.8.5 (default, Aug  5 2020 08:36:46)
SparkSession available as 'spark'.
>>> 

shell (conda environment)


$ pyspark                      
Python 3.7.7 (default, May  7 2020, 21:25:33) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
20/08/14 22:16:09 WARN Utils: Your hostname, <***> resolves to a loopback address: 127.0.1.1; using 192.168.3.17 instead (on interface wlp3s0)
20/08/14 22:16:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/14 22:16:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.6
      /_/

Using Python version 3.7.7 (default, May  7 2020 21:25:33)
SparkSession available as 'spark'.
>>> 

You can check that pyspark can be used for each as follows.

--Since you just prepared a virtual environment with conda and conda install, you can install and execute pyspark in the same way as other ordinary Python libraries.

Supplement (Java)

In addition, Java 11 is also supported from Spark 3, but when I tried it easily, I got a memory related error and I could not move it satisfactorily. .. .. Even if you look at here etc., it seems that additional settings are required when using Java 11 (it seems to be different from the above error) (I think), as the title says, if you want to run it with "** Easy for the time being **", I think that Java version 8 is safe even in Spark3. (In addition, it does not work unless it is Java 8 in Spark 2 series.)

Supplement (Windows)

The decent function works as above, but by default, an error around permissions occurs when operating the database table of spark.sql. [Here](https://qiita.com/tomotagwork/items/1431f692387242f4a636#apache-spark%E3%81%AE%E3%82%A4%E3%83%B3%E3%82%B9%E3%83% 88% E3% 83% BC% E3% 83% AB), etc.

--Download the hadoop 2.7 winutils.exe (eg here or [here ](Available from the repository at https://github.com/cdarlint/winutils) --Put through PATH --Set the above download location to the environment variable HADOOP_HOME

Need to be done additionally.

3. If additional settings are required

At this point you should be able to run pyspark easily (** with default settings **), but sometimes you need to configure and adjust config.

If you customize it in earnest, it will go beyond the "easy" range of the title, but I will supplement it only to the minimum. (I will omit common stories that are not unique to conda, such as setting general environment variables.)

Setting SPARK_HOME

I wrote that the environment variable JAVA_HOME (necessary for running Spark) is set on the conda side without permission, but the environment variable SPARK_HOME, which is often set when using Apache Spark, is not actually set. (It works relatively well even if it is not set, but sometimes it is a problem)

You can specify the installation location in the virtual environment, but the location is a little difficult to understand. I think there are various ways to do it, but as a personal research method,

  1. If you install pyspark with conda, you can also run spark-shell, which is the Spark shell of scala (it should also be in your PATH), so run spark-shell on the CLI.
  2. Type sc.getConf.get ("spark.home ") and press Enter to get the string that appears and set it in the environment variable SPARK_HOME

For example, it looks like this:

shell


$ spark-shell                
20/08/16 12:32:18 WARN Utils: Your hostname, <***> resolves to a loopback address: 127.0.1.1; using 192.168.3.17 instead (on interface wlp3s0)
20/08/16 12:32:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/08/16 12:32:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.3.17:4040
Spark context available as 'sc' (master = local[*], app id = local-1597548749526).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_192)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sc.getConf.get("spark.home")
res0: String = <Virtual environment PATH>/lib/python3.8/site-packages/pyspark

# ↑ `String = `The absolute path of the Spark installation location is displayed in the part of
# `Ctrl-C`Exit with and set the environment variables as follows

$ export SPARK_HOME=<Virtual environment PATH>/lib/python3.8/site-packages/pyspark

Or

  1. Run spark-shell in the same way
  2. Since it is supposed to run locally, access http: // localhost: 4040 after 1. and open the Spark UI
  1. Make a note of the path in spark.home on the Environment tab and set it in the environment variable SPARK_HOME

And so on. (This method is based on here.) For example, SPARK_HOME = / path / to / miniconda3-latest / envs / <virtual environment name> /lib/python3.7/site-packages/pyspark.

In short, scala's spark-shell automatically sets the appropriate spark.home only in SparkSession, but for some reason pyspark doesn't do it, so it's like checking using spark-shell.

Location of configuration files

The conf directory exists in Spark downloaded from the official etc., but it seems that the conf directory does not exist in the one automatically installed by conda. .. .. However, it seems that if you create a conf directory in the appropriate place and put the configuration file, it will be read. (Verified with spark-defaults.conf)

The location of the configuration file will be under $ SPARK_HOME / conf / using the path of SPARK_HOME that you checked earlier. So, for example

$SPARK_HOME/conf/spark-defaults.conf

You can set the config by creating and filling in.

I haven't tried other config files (eg conf / spark-env.sh), but I think it will work if you create and fill in the same way. (I'm sorry if it's different.)

I personally don't like it that much, as modifying the individual packages I put in with conda would make it less portable and dirty (the "easy" element of the title would fade). .. (It is a story that you can do it if you need it.)

However, even so, I think the merit of being able to keep the settings independently for each virtual environment remains.

Summary

I confirmed that pyspark can be easily installed and managed with conda, and that it is possible to customize the configuration file if you feel like it.

Recommended Posts

Easy to install pyspark with conda
Specify version with conda install
Easy to make with syntax
How to install python-pip with ubuntu20.04LTS
Unable to install Python with pyenv
How to install mysql-connector with pip3
Easy to draw graphs with matplotlib
How to install Anaconda with pyenv
How to install DLIB with 2020 / CUDA enabled
I want to pip install with PythonAnywhere
How to install zsh (with .zshrc customization)
How to install python3 with docker centos
[Python] Easy introduction to machine learning with python (SVM)
I wanted to install Python 3.4.3 with Homebrew + pyenv
How to install OpenGM on OSX with macports
Steps to install your own library with pip
Easy Grad-CAM with pytorch-gradcam
How to install Python
Convert 202003 to 2020-03 with pandas
Easy IoT to start with Raspberry Pi and MESH
How to install pip
[Introduction to WordCloud] It's easy to use even with Jetson-nano ♬
How to install archlinux
From Kafka to KSQL --Easy environment construction with docker
Easy to use Flask
Very easy to install SciPy on Mac OS X
How to install python
How to install caffe on OS X with macports
Error with pip install
Easy way to scrape with python using Google Colab
An easy way to create an import module with jupyter
How to install BayesOpt
Make it easy to install the ROS2 development environment with pip install on Python venv
Install Voluptuous with Python 2.5
Install torch-scatter with PyTorch 1.7
Easy to use SQLite3
Collaborative filtering with PySpark
Install python with pyenv
How to install Nbextensions
How to install Prover9
Easy debugging with ipdb
Install scikit.learn with pip
Easy TopView with OpenCV
Easy introduction to home hack with Raspberry Pi and discord.py
Easy to use Nifty Cloud API with botocore and python
How to install NPI + send a message to line with python
How to install Python2.7 python3.5 with pyenv (on RHEL5 CentOS5) (2016 Nov)
[Road to intermediate Python] Install packages in bulk with pip
Steps to install a Git cloned package locally with pip
Move what you installed with pip to the conda environment
After installing pygame with conda install No module named font
I want to install a package from requirements.txt with poetry
How to install Theano on Mac OS X with homebrew
[2020.8 latest] How to install Python
Easy tox environment with Jenkins
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Install Python environment with Anaconda
Install Keras (used with Anaconda)
Tabpy 1.0 (2020-01 version) How to install
Connect to Wikipedia with Python
Post to slack with Python 3