Investigate Java and python data exchange with Apache Arrow

This time, I used Apache Arrow to investigate the exchange of data between Java and python.

Introduction

It is expected that data exchange using Apache Arrow can be copied between systems and languages without copying in memory. To write the conclusion first, in this survey, it was exchanged via a byte array in Apache Arrow format. You don't have to write it to your local disk, but Java serialization and python deserialization will result in in-memory data copying. I think the advantage of the method introduced this time is that it is easy to link languages because there is a high-speed common format called Apache Arrow.

About exchanging Java and python data with Apache Arrow

Apache Arrow had a library called jvm for exchanging data between Java and python. This library implements a function that if you pass a java Apache Arrow object VectorSchemaRoot, it will convert it to a python RecordBatch.

def record_batch(jvm_vector_schema_root):
    """
    Construct a (Python) RecordBatch from a JVM VectorSchemaRoot
    Parameters
    ----------
    jvm_vector_schema_root : org.apache.arrow.vector.VectorSchemaRoot
    Returns
    -------
    record_batch: pyarrow.RecordBatch
    """

Therefore, it is assumed that data exchange between Java and python can be easily done by using py4j. image.png

Advance preparation

Get the jars you need to use py4j. Bring the arrow and py4j jars locally with the following pom.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>gtest</groupId>
  <artifactId>gtest</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>gtest</name>
  <url>http://maven.apache.org</url>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.arrow</groupId>
      <artifactId>arrow-vector</artifactId>
      <version>0.12.0</version>
    </dependency>
    <dependency>
      <groupId>net.sf.py4j</groupId>
      <artifactId>py4j</artifactId>
      <version>0.10.8.1</version>
    </dependency>
  </dependencies>

</project>

The maven command is as follows.

$ mvn dependency:copy-dependencies -DoutputDirectory=./lib -f ./pom.xml

Try the jvm library

Create a simple java class by referring to the py4j sample. It also implements a function that returns VecotrSchemaRoot for validation on py4j.

import java.util.List;
import java.util.ArrayList;

import py4j.GatewayServer;

import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BitVector;
import org.apache.arrow.vector.VectorSchemaRoot;

public class Py4jExample {

  public static void main( final String[] args ) {
    Py4jExample app = new Py4jExample();
    GatewayServer server = new GatewayServer( app );
    server.start();
  }

  /**
   *Prepare a function to create a VectorSchemaRoot to try.
   */
  public static VectorSchemaRoot create() {
    RootAllocator allocator = new RootAllocator( Integer.MAX_VALUE );
    BitVector vector = new BitVector( "test" , allocator );
    vector.allocateNew( 1 );
    vector.setSafe( 0 , 1 );
    vector.setValueCount( 1 );

    List list = new ArrayList();
    list.add( vector );

    VectorSchemaRoot root = new VectorSchemaRoot( list );
    root.setRowCount( 1 );

    return root;
  }

}

Compile and start.

$ javac -cp lib/*:. Py4jExample.java
$ java -cp lib/*:. Py4jExample &

Now that we're ready to call from python, let's write the python process.

from py4j.java_gateway import JavaGateway
import pyarrow as pa
import pyarrow.jvm as j
from pyarrow import RecordBatch

gateway = JavaGateway()

root = gateway.jvm.Py4jExample.create()
rb = j.record_batch( root )

df = rb.to_pandas()

I thought it would be easy to convert from java to python record_batch using the jvm library like this ...

$ python test.py
Segmentation fault

You can run it up to the point where you get RecordBatch, but it seems to crash when you call to_pandas (). I can't tell if my environment was bad, but I didn't seem to understand it right away, so I decided to give up this time.

Data exchange in a byte array of Apache Arrow

Alternatively, Apache Arrow also implements a binary format for the file, so it can be read and written in Java and python as a byte array. Looking at the current (as of 03/03/2019) python implementation of the jvm, it seems that it is copying in memory when it takes a Java object and converts it to a python object. Therefore, serialization and deserialization are required for data exchange in byte arrays, but only serialization to byte arrays in Java is costly.

Java and python data exchange in byte array

Add a function to your Java class that creates a byte array.

  /**
   *Prepare a function to create an Arrow byte array to try.
   */
  public static byte[] createArrowFile() throws IOException {
    RootAllocator allocator = new RootAllocator( Integer.MAX_VALUE );
    BitVector vector = new BitVector( "test" , allocator );
    vector.allocateNew( 1 );
    vector.setSafe( 0 , 1 );
    vector.setValueCount( 1 );

    List list = new ArrayList();
    list.add( vector );

    VectorSchemaRoot root = new VectorSchemaRoot( list );
    root.setRowCount( 1 );

    ByteArrayOutputStream out = new ByteArrayOutputStream();
    ArrowFileWriter writer = new ArrowFileWriter( root, null, Channels.newChannel( out ) );
    writer.start();
    writer.writeBatch();
    writer.end();
    writer.close();
    return out.toByteArray();
  }

Compile this class and start it again. The processing in python is modified to get RecordBatch from the byte array.

from py4j.java_gateway import JavaGateway
import pyarrow as pa
import pyarrow.jvm as j
from pyarrow import RecordBatch

gateway = JavaGateway()

reader = pa.RecordBatchFileReader( pa.BufferReader( gateway.jvm.Py4jExample.createArrowFile() ) )
rb = reader.get_record_batch(0);

df = rb.to_pandas()

print df

As expected, the result of this execution was that VectorSchemaRoot created in Java could be treated as RecordBatch in python.

 $ python test2.py
   test
0  True

Summary

Data exchange between Java and python has now become a brute force implementation. However, Apache Arrow has the advantage that it is compatible with libraries that process data with python such as pandas, and that you do not have to implement your own serialization and deserialization.

The purpose of this research was to make the file format we are currently developing implemented in Java and also available in python, which is commonly used in data processing. Next time, based on this research, I would like to write an implementation and operation example of reading and writing from python to a file format with a byte array.

Recommended Posts

Investigate Java and python data exchange with Apache Arrow
Data pipeline construction with Python and Luigi
CentOS 6.4 with Python 2.7.3 with Apache with mod_wsgi and Django
Exchange encrypted data between Python and C #
Data analysis with python 2
Data analysis with Python
Benchmark for C, Java and Python with prime factorization
Sample data created with python
Programming with Python and Tkinter
Encryption and decryption with Python
Python and hardware-Using RS232C with Python-
Apache mod_auth_tkt and Python AuthTkt
Get Youtube data with python
I compared Java and Python!
python with pyenv and venv
Works with Python and R
Read json data with python
Word Count with Apache Spark and python (Mac OS X)
Get rid of dirty data with Python and regular expressions
Solving with Ruby, Perl, Java and Python AtCoder ATC 002 A
Solve the spiral book (algorithm and data structure) with python!
Solving with Ruby, Perl, Java and Python AtCoder ATC 002 B
Get additional data to LDAP with python (Writer and Reader)
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Difference between java and python (memo)
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
Java and Python basic grammar comparison
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
Make apache log csv with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
Scraping with Python, Selenium and Chromedriver
Scraping with Python and Beautiful Soup
[Python] Get economic data with DataReader
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
JSON encoding and decoding with python
[GUI with Python] PyQt5-Drag and drop-
Python data structures learned with chemoinformatics
Hashing data in R and Python
Reading and writing NetCDF with Python
I played with PyQt5 and Python3
Solve with Ruby, Python and Java AtCoder ARC104 B Cumulative sum
List split and join strings with split and join (Perl / PowerShell / Java / Kotlin / Python)
Easy data visualization with Python seaborn.
Reading and writing CSV with Python
Move data to LDAP with python Change / Delete (Writer and Reader)
Multiple integrals with Python and Sympy
Solving with Ruby, Perl, Java, and Python AtCoder ABC 065 C factorial
Until you run python with apache
Process Pubmed .xml data with python
Data analysis starting with python (data visualization 1)
Coexistence of Python2 and 3 with CircleCI (1.0)
Easy modeling with Blender and Python
Data analysis starting with python (data visualization 2)
Python application: Data cleansing # 2: Data cleansing with DataFrame
Sugoroku game and addition game with python
FM modulation and demodulation with Python
Access WebAPI with Python and register / acquire IoT data (dweet.io, Requests, HTTPie)