Vectorization conversion of HotSpot JavaVM

Recent HotSpot JavaVMs are optimized to vectorize iterative processing of scalar operations and convert them into SIMD instructions (what is SIMD? See the second half). When I tried it with the code that actually works for optimization, ** 1.5 to 2.8 times speed improvement ** was seen, so it can be used well with Java libraries that perform a lot of arithmetic processing (do not rely on GPGPU). It may be an effective optimization method.

HotSpot's SIMD optimization is based on Superword-Level Parallelism (SLP) (hereafter this optimization is Called SLP). Originally, this paper was intended to reflect the times and convert non-SIMD-compatible image / audio processing into instructions that use SIMD at the compiler and runtime layers, but this is Java bytecode without SIMD instructions. It is also effective for using SIMD.

Vectorization conversion of HotSpot JavaVM

SLP conversion process

Assume the following simple loop processing.

for(int i=0; i<1024; i++){
  a[i] = b[i] + c[i];
}

For this loop [loop unrolling](https://ja.wikipedia.org/wiki/%E3%83%AB%E3%83%BC%E3%83%97%E5%B1%95%E9%96 % 8B) (Loop unrolling).

for(int i=0; i<1024; i+=4){
  a[i] = b[i] + c[i];
  a[i+1] = b[i+1] + c[i+1];
  a[i+2] = b[i+2] + c[i+2];
  a[i+3] = b[i+3] + c[i+3];
}

This alone reduces the number of interactions, the number of condition judgments, and the penalty for branch prediction, so speedup can be expected (there are quite a few such loop unrolls that can be done with a compiler or virtual machine).

Now, the processing inside the loop can be regarded as the following vector operation.

\begin{pmatrix} a_{i+0} \\ a_{i+1} \\ a_{i+2} \\ a_{i+3} \end{pmatrix} = \begin{pmatrix} b_{i+0} \\ b_{i+1} \\ b_{i+2} \\ b_{i+3} \end{pmatrix} + \begin{pmatrix} c_{i+0} \\ c_{i+1} \\ c_{i+2} \\ c_{i+3} \end{pmatrix}

This vector operation is actually replaced with a vector operation instruction.

for(int i=0; i<1000; i+=4){
  load b[i:i+3] to xmm0
  load c[i:i+3] to xmm1
  add xmm0, xmm1 and store result to xmm0
  load xmm0 to a[i:i+3]
}

With this conversion, it is possible to convert the repetition of simple arithmetic processing into a vector arithmetic. Since all modern CPUs can use SIMD instructions, it is effective to use SIMD without changing the source or bytecode.

Behavior in Java 8

Actually enable / disable SLP using the following sources and compare the execution times. The measurement is appropriate, so if you need accurate evidence, please do it yourself.

import java.util.Arrays;

public class J8SIMD {
  private static final int SIZE = 1024 * 1024;
  private static final float[] a = new float[SIZE];
  private static final float[] b = new float[SIZE];
  static {
    Arrays.fill(a, 1);
    Arrays.fill(b, 2);
  }
  public static void vectorAdd(){
    for(int i=0; i<a.length; i++){
      a[i] += b[i];
    }
  }
  public static void main(String[] args){
    // warming up
    for(int i=0; i<100; i++) vectorAdd();
    // measure
    long t0 = System.currentTimeMillis();
    for(int i=0; i<10000; i++){
      vectorAdd();
    }
    long t1 = System.currentTimeMillis();
    System.out.printf("vectorAdd: %,d[msec]", t1 - t0);
  }
}

Running the above source on a Core i7-6600U (MMX, SSE, SSE2, SSSE3, SSE4, AVX compatible) makes a significant difference (although I simply loop unrolled because I'm not sure if I used the SIMD instruction). There is no possibility of just a difference: rolling_eyes :).

I used float here, but both ʻint and doublemake a significant difference, so give it a try. However, due to the data width,double is not as different as ʻint and float.

SLP is enabled by default from around Java7u40, and if you want to disable it, use the -XX: -UseSuperWord option.

java8simd>java -XX:+UseSuperWord J8SIMD
vectorAdd: 4,624[msec]

java8simd>java -XX:-UseSuperWord J8SIMD
vectorAdd: 6,928[msec]

java8simd>java -version
java version "1.8.0_141"
Java(TM) SE Runtime Environment (build 1.8.0_141-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode)

The execution time (milliseconds) of each operation is as follows.

Calculation	SLP enabled	SLP disabled	Speed ratio
`a[i] += b[i]`	4,645	6,928	1.50
`a[i] -= b[i]`	4,647	7,029	1.52
`a[i] *= b[i]`	4,629	6,978	1.51
`a[i] /= b[i]`	4,357	12,151	2.79
`a[i] %= b[i]`	-	-	-

I gave up because % = didn't finish after waiting a few minutes (why is it so slow?). Addition, subtraction and multiplication showed a 1.5x speedup. The speedup of division is particularly remarkable at nearly 3 times, but is there any movement such as processing on a dedicated FPU when executed with SIMD instructions?

The following is the result of changing to the ʻint` version and performing bit operations. Bitshift operations do not seem to be optimized in SLP.

Calculation	SLP enabled	SLP disabled	Speed ratio
`a[i] <<= b[i]`	18,293	19,249	1.05
`a[i] >>= b[i]`	18,550	19,431	1.05
`a[i] >>>= b[i]`	19,862	20,105	1.01
`a[i] ｜= b[i]`	4,735	6,857	1.45
`a[i] &= b[i]`	4,756	6,799	1.43
`a[i] ^= b[i]`	4,697	7,532	1.60

Whether or not SLP worked effectively is a remarkable number, so it seems that you can judge by the presence or absence of the -XX: -UseSuperWord option.

Conditions for activating SLP

Debug Compiled OpenJDK will show how loops are recognized when using the -XX: + PrintCompilation -XX: + TraceLoopOpts option, and -XX: + PrintAssembly will show the assembly generated by HotSpot. is.

Java 8 / 9ea listed in: rolling_eyes: Vectorization in HotSpot JVM I will put the conditions of. I think that the triggering conditions and restrictions will be relaxed in future versions, so please follow the current affairs.

Must be a Trip-Counted Loop. In other words, it is a count loop, and limit does not change during the loop and step is determined at compile time.

for(int i=0; i<100; i++){ } // OK
for(int i=5; i<100; i++){ } // OK
for(int i=t0; i<t1; i++){ } // OK
for(int i=0; i<100; i+=4){ } // OK
for(int i=0; i<100; i+=d){ } // NG
for(int i=0; i<100; i*=2){ } // NG
for(int i=0; i<t1; i++){     // NG
  ...
  if(...) t1 ++;
}

Loop counters are only ʻint, short, char`.

for(byte i=0; i<100; i++){ }  // NG
for(char i=0; i<100; i++){ }  // OK
for(short i=0; i<100; i++){ } // OK
for(int i=0; i<100; i++){ }   // OK
for(long i=0; i<100; i++){ }  // NG

Loop unrolling is not performed when a function call is included.

for(int i=0; i<100; i++){  // NG
  ...
  f();
}

The target data is an array.

for(int i=0; i<100; i++){  // OK
  a[i] += b[i];
}
for(int i=0; i<100; i++){  // NG
  a.set(i, a.get(i) + b.get(i));
}

It must be a simple four arithmetic operation.

a[i] ++;  // OK
a[i] += 2; // OK
a[i] *= b[i]; // OK
a[i] = b[i] + c[i]; // OK

The index is continuous.

for(int i=0; i<100; i++){  // NG
  a[i] += b[2 * i];
}

Aggregation may be partly possible from Java 9

int r = 0;
for(int i=0; i<100; i++){
  r += a[i] * b[i];
}

Behavioral features and Java 9 extensions

Not surprisingly, collection API and Stream operations in Java (and JavaVM languages such as Scala and Kotlin) are not operations that can be translated into SIMD instructions. If the bottleneck is where you're doing a lot of math with them, it might be worth considering replacing them with array & loop counts.

Since ʻUnsafe is a method call, you cannot expect this optimization in code that uses ʻUnsafe for speed-first risk-taking.

Vectorization has a 64-byte alignment constraint, and the fractions ahead of the array are done in an unvectorized pre-loop. Elements that are not multiples of 4 at the end of the array are also post-looped without vectorization. However, post-loop is said to be vectorized in Java 9.

Java 8 uses 128bit registers (xmm), so it is vectorized every 2 or 4 elements, but in Java 9 AVX2 can be increased to 32 or something.

SIMD

SIMD </ b> (Single Instruction Multiple Data) is a mechanism to apply the same instruction to a lot of numerical data arranged at once.

Science and technology arithmetic processing such as fluid simulation, image processing, voice processing, encryption, and machine learning processing frequently involve linear algebraic operations such as "multiply array $ a $ by the value of array $ b $". It comes out in large quantities.

A GPU or supercomputer vector arithmetic unit is a collection of a large number of low-cost, power-saving, and space-saving arithmetic cores that eliminate unnecessary circuits to perform a large amount of arithmetic. These can apply one instruction to a large amount of data in a register at a time.

Recent CPUs have CPU instructions that perform several to dozens of vector operations. GPUs that handle a large amount of arithmetic in image processing have thousands of processors dedicated to arithmetic and issue arithmetic instructions to them at once. Both are SIMDs, but HotSpot's SLP optimizes for CPU instructions.

Reference material

Exploiting Superword Level Parallelism with Multimedia Instruction Sets

Virtualization in HotSpot JVM

JAVA and SIMD

Recommended Posts
How to get SIMD optimization from HotSpot JavaVM

[IOS] How to get data from DynamoDB

How to get a heapdump from a Docker container

How to get Class from Element in Java

How to get jdk etc from oracle with cli

How to migrate from JUnit4 to JUnit5

[IOS] How to get the table name from AWS DynamoDB

How to get the longest information from Twitter as of 12/12/2016

[IOS14] How to get Data type image data directly from PHPickerViewController?

How to use Java HttpClient (Get)

How to push from Tarminal to GitHub

How to get started with slim

How to get parameters in Spark

How to get along with Rails

How to change from HTML to Haml

How to get and add data from Firebase Firestore in Ruby

[Java] How to convert from String to Path type and get the path

How to get date data in Ruby

[Kotlin] 3 ways to get Class from KClass

[Rails] How to convert from erb to haml

[Note] How to get started with Rspec

How to call Swift 5.3 code from Objective-C

[Java] How to get the current directory

[Flutter] How to use C / C ++ from Dart?

Java: How to send values from Servlet to Servlet

How to get the date in java

How to get the setting value (property value) from the database in Spring Framework

Get "2-4, 7, 9" from [4, 7, 9, 2, 3]

[Spring Boot] How to get properties dynamically from a string contained in a URL

How to get an arbitrary digit from a number of 2 or more digits! !!

How to get the date from the JavaScript Date type that C # developers are addicted to

How to get keycloak credentials in interceptor class

[Java] How to get and output standard input

How to dump from database (DB) to seeds file

How to get today's day of the week

How to get started with Eclipse Micro Profile

[Java] How to get random numbers excluding specific numbers

How to get and study java SE8 Gold

How to get resource files out with spring-boot

[Java] How to get the redirected final URL

[Rails] How to get success and error messages

[Java] How to get the authority of the folder

[ruby] How to receive values from standard input?

[Ruby] Summary of how to get values from standard input [Paiza skill check measures]

How to deploy

[Swift] How to play songs from your music library

How to jump from Eclipse Java to a SQL file

How to write Scala from the perspective of Java

How to deploy to Heroku from a local docker image

Rails / Ruby: How to get HTML text for Mail

[Java] How to extract the file name from the path

How to get information about associated tables in many-to-many tables

List how to learn from Docker to AKS on AWS

[Java] How to get the maximum value of HashMap

How to remove Tiles from TERASOLUNA 5.x blank project

How to connect to ClearDB from Sequel Pro on Heroku

[Java] How to get a request by HTTP communication

How to output standard from an array with forEach

As of April 2018 How to get Java 8 on Mac

How to change from Oracle Java 8 to Adopt Open JDK 9

[Android] How to get the setting language of the terminal