Roughly speaking

Random access to list-structured data is not allowed. It's a promise with my brother!

Origin

A few years ago, I was contacted to ask for help because it would be unbelievably slow if I put a little big data into a Java system created with the support of other departments.

Well, if you look at the query or index for a moment, it will be fixed ... I picked my nose and headed for support.

Processing content

The processing of the slow part was as follows.

Receive the data to be processed as a List.
Preprocess it one by one with a for loop.
Store the processing result in the object and insert it into the DB with the OR mapper.

only this? Yes, that's it. Of course, parallel processing is not a high-class thing.

Infrastructure survey

Investigate what the server is doing. The infrastructure this time is a typical 3-layer 3-server configuration.

Everything on the WEB server can afford. The AP server has used up one CPU. Since it is a 14-core server (EC2 c4.4xlarge), it is sad that the usage rate is 7% no matter how hard one CPU works. There is some disk load on the DB server, but there is a margin in terms of performance.

I also investigated the network, memory, etc., but all the resources are surplus. Various setting values are also normal.

Investigate around DB

As you can see, it's a process that only INSERT is done. Therefore, although I was worried about the CPU load of the AP, everyone on the spot was suspicious of the DB and messing around with it. In the end, did you step on the bug of OR Mapper? I also had a nostalgic experience of using JDBC directly.

However, there was no result, and Rambo's angry holiday work was decided ...

Discovery

Go back to the basics and redo the analysis of logic processing time. Under the testimony that "I feel that the processing becomes heavier as the number of loops progresses" I decided to find out the relationship between the number of loops and the processing time of one loop.

The following is an image of the analysis results. We changed the amount of input data and compared N10,000 and 20,000.

** "Half after ... !!" **

Since the smaller data is the first half of the larger data, it does not seem to fluctuate depending on the content of the data. Upon closer analysis, the line that took the most time at the top of the graph was: The type of the variable data isList <MyClass>.

MyClass record = data.get(i);

I was amazed at this. Just getting the elements from the List is insanely slow. Someone opened his mouth.

** "This is a Linked List, isn't it?" **

It took a tremendous amount of time to retrieve the N10,000th of the Linked List, which has 2N million records. Perhaps the data in the latter half is accessed from the bottom, based on half the size of the list.

The trouble that invited us to take a break ended with a change in the refactoring level of changing the for statement to an extended for statement.

for (int i = 0; i < data.size(); i++) {
    MyClass record = data.get(i);
    ...
}

↓

int i = 0;
for (MyClass record: data) {
    ...
    i++;
}

What's the difference at first glance? Although it is a change, for List, random access before the change, After the change, it became sequential access (by Iterator), and when the number of data was large, there was a big difference in performance.

I learned a lot.

The reason I didn't use the extended for statement at first was that I wanted to use the value of i in the process. I wish Java had a Python enumerate ... (grudge)

A story that ended up taking a break when using the Linked List with a light feeling