Insert Eijiro's dictionary data into Oracle DB part2

At the beginning

The procedure when inserting Eijiro's dictionary data into an Oracle table is described. This time, I will extract the word information from the HTML file created last time and insert it in the DB.

Information analysis of HTML files

Thing you want to do

If you simply write what you want to do

  1. Use Jsoup to get the text information belonging to the tr tag element
  2. Convert the text information obtained in 1. to a type (Object []) that can be executed by the batch method of CherryRunner.
  3. Insert 10,000 lists of List \ <Objects [] > created in 1. and 2. is

Implementation

WordInserter.java


package word;

import java.io.File;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import org.jsoup.Jsoup;
import word.sql.WordDictionarySql;

public class WordInserter {

	private static String WORD_DICTIONARY =
			"C:/Users/*****/Desktop/puisan_102.txt";
	private static int LIMMIT = 10000;

	public static void main(String args[])throws Exception{
    	long start = System.currentTimeMillis();
    	System.out.println("List creation process starts");
		List <Object[]>list =getDictionarySource()
				.stream().collect(Collectors.toList());
		long end = System.currentTimeMillis();
    	System.out.println("List creation process is completed");
		System.out.println(String.format("processing time%d seconds",(end-start)/1000));
		
    	start = System.currentTimeMillis();
    	System.out.println("Start DB processing");
		OperateDB.createConnection();
		operateList(list);
		end = System.currentTimeMillis();
    	System.out.println("DB creation process is complete");
		System.out.println(String.format("processing time%d seconds",(end-start)/1000));
		OperateDB.closeConnection();
	}
	
	//A function that uses Jsoup to parse HTML tags and list the results
	public static List<Object[]> getDictionarySource(){
		try{
			return Jsoup.parse(new File(WORD_DICTIONARY),"UTF-8")
					//Get all child elements belonging to the tag tr
					.getElementsByTag("tr") 
					.stream()
					//Gets the text of the child element belonging to the tag tr and
					//Data structures that can be inserted into the DB using the map function
					//(In this case Object[])Convert to
					.map(s->new DictionaryWord(
							s.getAllElements().eachText()).params)
					//List so that it can be inserted by batch of QueryLinner<Object[]>
					//Aggregate to
					.collect(Collectors.toList());
		}catch(Exception e){
			e.printStackTrace();
			return new ArrayList<Object[]>();
		}
    }
	
	//Execute Insert processing in units of 10,000 for the List passed as an argument
	public static void operateList(List<Object[]> word){
    	Stream.iterate(1, s->s+1)
    	.limit((int)Math.ceil(word.size()/LIMMIT))
    	.forEach(i->executeSQL(word,i));
	}
	
    public static void executeSQL(List<Object[]>list,int from){
    	if(list.size()<((from+1)*LIMMIT))
    		WordDictionarySql.insertToWordDictionary(
    				list.subList(from*LIMMIT, list.size()));
    	else 
    		WordDictionarySql.insertToWordDictionary(
    				list.subList((from-1)*LIMMIT, from*LIMMIT));
    }
	
	public static class DictionaryWord{
		public long wordIndex;
		public int wordLevel;
		public String word;
		public String searchWord;
		public String meaning;
		public String shortVer;
		public Object[] params=new Object[6];
		
		public DictionaryWord(List<String>list){
			this.wordIndex = Long.parseLong(list.get(1));
			params[0] = this.wordIndex;
			this.wordLevel = Integer.parseInt(list.get(3).replaceAll("■", ""));
			params[1] = this.wordLevel;
			this.word = list.get(2);
			params[2] = this.word;
			this.searchWord = list.get(2).toLowerCase();
			params[3] = this.word;
			this.meaning = list.get(4);
			params[4] = this.meaning;
			this.shortVer = list.get(4).length()>1000?
					list.get(4).substring(0,1000):list.get(4);
			params[5] = this.shortVer;
		}
		public String toString(){
			String format = "wordIndex=%d  wordLevel=%d "
					+ "word=%s meaning=%s shortVer=%s ";
			return String.format(format,this.wordIndex,this.wordLevel, 
					this.word,this.meaning,this.shortVer);
		}
	}
}

Miscellaneous feelings

I am surprised that the DOM created as a result of Jsoup's analysis and the Stream function of Java 8 are too compatible. It wouldn't be great without the Stream function, but maybe I didn't want to implement it in Java (In that case, I think I was using ruby)

Insertion process into DB

Thing you want to do

Just pass the list of List \ <Ojbect [] > given as an argument to OperateDB # insert created in Previous article.

Implementation

WordDictionarySql.java


package word.sql;

import java.util.List;
import word.OperateDB;

public class WordDictionarySql {
	
    public static String INSERT_SQL = "insert into WORD_DICTIONAY "
    		+ "(WORD_INDEX,WORD_LEVEL,WORD,SEARCH_WORD,MEANING,SHORT_VER) values(?,?,?,?,?,?)";
    
    public static void insertToWordDictionary(List<Object[]> word){
    	try{
    		OperateDB.insert(word,INSERT_SQL);
    	}catch(Exception e){
    		e.printStackTrace();
    		throw new RuntimeException("An error occurred during DB insert processing");
    	}
    }
}

Impressions

I put about 330,000 data into the Oracle table, but the execution time was about 102 seconds. By the way, when autocommit was enabled, the execution time was about twice as long as 230 seconds. From the top of the numbers, I was able to realize that enabling autocommit will reduce performance.

Recommended Posts

Insert Eijiro's dictionary data into Oracle DB part2
Insert Eijiro's dictionary data into an Oracle table
Insert data into DB using Yaml file
Until you insert an S3 object into an EC2 DB with Lambda @ java: Java [Part 2]
Until you insert an S3 object into an EC2 DB with Lambda @ java: Java [Part 1]
Try putting a lot of data into Cosmos DB
Set up a Db2 DB container and insert a little data