Create a database of all the books that have been circulating in Japan in the last century

0. Introduction (target audience, etc.)

It is a fishing title. I'm sorry. In fact, we will create a database of all ** "699 TV / Radio" books that have been circulating in Japan during the last century. Also, ** Since a large amount of automatic access to the server of the National Diet Library will be performed, please do so at your own risk **.

Target audience

--People who want to search the National Diet Library automatically (without using API) --People who can do java to some extent --People who don't get angry with private methods --Mentally adult who can understand self-responsibility (very rarely, there are social gangsters who unfortunately do not fall under this)

1. Experiment

If you do a detailed search with the National Diet Library Search and search only with the classification symbol "699", You can search for all books on television and radio from 1925 to 2020. But unfortunately, only the top 500 results can be seen in a single search. (The 501st and subsequent items are not displayed in the search results in the first place) You can also refine your search by year of publication, but in any year from 1925 to 2020, the results did not exceed 500 (in one year). So, if you do a search and get the results every year, it will take some time, but let's try to see all the results.

1-1. Experiment 1 ndl crawler design

A crawler is a program that collects information from the web. Let's design a program that collects information from the National Diet Library search by calling it an "ndl crawler". It can be designed by the following procedure.

  1. Use the National Diet Library Search from your browser
  2. Enter all the items in the advanced search in alphanumeric characters and execute the search.
  3. Check which item corresponds to which parameter by comparing with get parameter
  4. Design the program based on the hints obtained up to 3.

1-1-1. Steps 1-3

First, when I searched under the conditions shown in Figure 1.1.1.1, the url of the search results (thumbnail display) was as follows. https://iss.ndl.go.jp/books?datefrom=1234&place=place&rft.isbn=isbn&rft.title=title&dateto=5678&rft.au=author&subject=subject&do_remote_search=true&rft.pub=publisher&display=thumbnail&ndc=genre&search_mode=advanced

image.png Figure 1.1.1.1 url Search criteria for parameter observation

By comparison, we can see that it is as shown in Table 1.1.1.1. (However, only the "result page number" was confirmed in another trial.)

Table 1.1.1.1 National Diet Library url parameters

item Parameters value
title rft.title
Author editor rft.au
the publisher rft.pub
Publication year section start point datefrom
Publication year section end point dateto
subject subject
Classification symbol ndc
ISBN/ISSN rft.isbn
Place of publication place
Result page number page
Thumbnail display of results display thumbnail
Other than that 1/2 do_remote_search true
Other than that, I'll keep it for the time being 2/2 search_mode advance

1-1-2. Step 4 (Crawler design)

Create the java package ndl, and create the following NDL Rawler.java in it. (The Parser class that appears in the code is described in Section 1-1-3. Briefly, the content of the result page passed to the constructor is molded into csv with the parse () method. The has15 () method returns a boolean value indicating whether the result is 15. ) (The WebGetter class that appears in the code just fetches the html source from the internet with the get method)

NDLCrawler.java


package ndl;

import java.io.*;
import java.net.*;

public class NDLCrawler
{
	private String url = "https://iss.ndl.go.jp/books?",
	title="", author="", publisher="", yearfrom="",yearto="", subject="", bunrui="", isbn_issn="", place="";
	public void setTitle(String str){title=str;} public void setAuthor(String str){author=str;} public void setPublisher(String str){publisher=str;} public void setYearfrom(String str){yearfrom=str;} public void setYearto(String str){yearto=str;} public void setSubject(String str){subject=str;} public void setBunrui(String str){bunrui=str;} public void setIsbn_issn(String str){isbn_issn=str;} public void setPlace(String str){place=str;}
	public String crawle()
	{
		System.out.println("Crawler activation");
		String csv="";
		String urlWithGet = url+ "rft.title=" + title + "&rft.au=" + author + "&rft.pub=" + publisher + "&datefrom=" + yearfrom + "&dateto=" + yearto + "&subject=" + subject + "&ndc=" + bunrui + "&rft.isbn=" + isbn_issn + "&place=" + place;
		urlWithGet = urlWithGet + "&do_remote_search=true&display=thumbnail&search_mode=advanced";
		System.out.println("  url:"+urlWithGet+"&page=(page number)");
		WebGetter wg = new WebGetter();
							try {
		for(int page=1; page<=34; page++)
		{
			System.out.println("   "+page+"Page page");
			String source = wg.get(urlWithGet+"&page="+page);
			Parser p = new Parser(source, false);
			csv = csv + p.parse().replaceFirst("^(\r\n|\r|\n)", "");
			if(!p.has15()) break;
		}
		System.out.println("Crawler end");
		return csv;
							} catch (IOException e) {e.printStackTrace();return null;}

	}
}

The WebGetter class looks like this: (Add after NDLCrawler.java)

WebGetter class


/**
 *
 *Reference site:https://www.javalife.jp/2018/04/25/java-%E3%82%A4%E3%83%B3%E3%82%BF%E3%83%BC%E3%83%8D%E3%83%83%E3%83%88%E3%81%AE%E3%82%B5%E3%82%A4%E3%83%88%E3%81%8B%E3%82%89html%E3%82%92%E5%8F%96%E5%BE%97%E3%81%99%E3%82%8B/
 *
 */
class WebGetter
{
	String get(String url) throws MalformedURLException, IOException
	{
		InputStream is = null; InputStreamReader isr = null; BufferedReader br = null;
							try {
		URLConnection conn = new URL(url).openConnection();
		is = conn.getInputStream();
		isr = new InputStreamReader(is);
		br = new BufferedReader(isr);

		String line, source="";
		while((line = br.readLine()) != null)
			source=source+line+"\r\n";
    	return source;
							}finally {br.close();isr.close();is.close();}
	}

}

1-1-3. Step 4 (Create csv from the source (html) of the result page)

Regarding the html source of the search results in the "National Diet Library Search", we found the following rules.

--<a href="https://iss.ndl.go.jp/books/ (book ID consisting of alphanumeric characters and hyphens) "> (character string 1) always comes immediately after the book title .. The reverse is true at least once for each book. --At least once for each book, immediately after the title </a> (line feed) (white space) + </ h3> (line feed) (white space) + <p> (line feed) (white space) + Is coming. Let's call this string 2. ―― (Author name) /.*(,(Author name) /.*)*(Character string 3) always comes immediately after "Character string 1, Book title, Character string 2". ――The publisher name comes after 2 lines of the last line of "Character string 1, book title, character string 2-3", the publication year comes after 3 lines, and the series name comes after 4 lines. If there is missing information, an empty value will be entered and the line will not be skipped.

If you make a java program that creates a csv file with book titles arranged according to this rule (and a few exceptions), it will be as follows. (The process of converting html source to csv is called Parse.)

Parser.java


package ndl;

public class Parser
{
	private boolean has15;
	private String csv;

	Parser(String source, boolean needHeader)
	{
		this.csv=needHeader?"National Diet Library Link,title,Author,Publisher,Year,series\n":"\n";
		String[] books = divide(source);//「<a href="https://iss.ndl.go.jp/books/Separated by
		books = remove0(books);//Only the beginning is meaningless data, so cut it off
		has15 = books.length==15;//By default, the number of search results is 15 per page
		String link, title, publisher, year, series;
		String[] authors;
		for(String book : books)//About each book
		{
			book = book.replaceAll("((\r\n)|\r|\n)( |\t)*<span style=\"font-weight:normal;\">[^<]+</span>","");//If the series is numbered, you can apply it to the "law" by cutting that information.
			link = getLink(book).replaceAll(",", "、");
			title = getTitle(book).replaceAll(",", "、");
			authors = getAuthors(book);
			publisher = getPublisher(book).replaceAll(",", "、");
			year = getYear(book).replaceAll(",", "、");
			series = getSeries(book).replaceAll(",", "、");//Extract detailed information
			for(String author : authors)//Convert to csv
				csv = csv + link+","+title+","+author.replaceAll(",", "、")+","+publisher+","+year+","+series+"\n";
		}
	}

	public boolean has15(){return has15;}
	public String parse() {return csv;}

	//Private methods that aren't really good
	private String[] divide(String source){return source.split("<a href=\"https://iss\\.ndl\\.go\\.jp/books/", -1);}
	private String[] remove0(String[] before)
	{
		String[] after = new String[before.length-1];
		for(int i=1; i<before.length; i++)after[i-1]=before[i];
		return after;
	}
	private String getLink(String book){return "https://iss.ndl.go.jp/books/"+book.split("\"")[0];}//「"You just have to return the 0th separated by ""
	private String getTitle(String book){return book.split("<|>")[1];}//「<Or ">You just have to return the first one separated by
	private String[] getAuthors(String book){return book.split("(\r\n)|\r|\n")[3].replaceFirst("( |\t)*", "").split("/([^,])+,?");}
	private String getPublisher(String book){return book.split("(\r\n)|\r|\n")[5].replaceFirst("( |\t)*", "");}
	private String getYear(String book){return book.split("(\r\n)|\r|\n")[6].replaceFirst("( |\t)*", "");}
	private String getSeries(String book){return book.split("(\r\n)|\r|\n")[7].replaceFirst("( |\t)*", "");}


}

1-1-4. Step 4 (crawler control)

Designed a crawler class to access the National Diet Library Search in Sections 1-1-2 In Section 1-1-3, we implemented a class that generates csv based on the information obtained by the crawler. In Sections 1-1-4, let's design a class that actually controls the crawler using these classes.

As the control content, specify each of 1925 to 2020 in the for statement, specify 699 as the classification number, operate the crawler, and write the completed csv to the file. Also, if you do not know the number of years information, it seems that it is treated as "1900", so consider this as well.

Main.java


package ndl;

import java.io.*;

public class Main
{
	public static void main(String...args)
	{
		String header = "National Diet Library Link,title,Author,Publisher,Year,series,library\n";
		NDLCrawler c = new NDLCrawler();
		c.setBunrui("699");
		generateCsv(c);
	}

	private static void generateCsv(NDLCrawler c)
	{
		System.out.println(1900);
		c.setYearfrom("1900");
		c.setYearto("1900");
		output(c.crawle());//Write at the end
		for(int year=1925; year<=2020; year++)
		{
			System.out.println(" "+year);
			c.setYearfrom(""+year);
			c.setYearto(""+year);
			output(c.crawle());//Write at the end
		}
	}

	private static void output(String csv)
	{
		String path = "D:\\all699.csv";//The path can be changed arbitrarily
		System.out.println("output"+csv);
							try{
		FileWriter fw = new FileWriter(path, true);//Addition mode at the end with the second argument true
		fw.write(csv);
		fw.close();
							} catch (IOException e) {e.printStackTrace();}
	}
}

2. Result

The experimental environment and conditions are shown in Table 2-1.

Table 2-1. Experimental environment and conditions

Item Value
OS windows10
Software Eclipse IDE for Enterprise Java Developers (4.11.0)
Providers and Sources Jupiter Telecommunication Co. Ltd (210.194.32.203)
Program start date and time (Japan time) January 17, 2020 20:44:00
Program stop time and operation period, reason for stop January 17, 2020 21:39:44
(about 56 minutes, normal end)

The output file all699.csv (uploaded to github) has 12145 lines.

In addition, after the experiment, I accessed the National Diet Library search with a browser and looked at the number information in the search results under the same conditions, which was 12633. Due to the inconsistency in the number of books, we investigated and found that the total number of books given 1900 and 1925-2020 as publication year information is 12030. This is slightly less than 12145 cases, but it is thought that it is due to the specification of the Parser.getAuthors method that states that" when there are multiple author information, each line is assigned separately. "

3. Future outlook

In the National Diet Library Search, you can search the collections of several local libraries in addition to the National Diet Library. I'm interested in using this to apply it to research that classifies "books that are placed everywhere" and "books that are not", so I would like to try this.

Recommended Posts

Create a database of all the books that have been circulating in Japan in the last century
Determine that the value is a multiple of 〇 in Ruby
Let's create a TODO application in Java 5 Switch the display of TODO
[Android] Develop a service that allows university students to check the operating status of buses circulating in the university.
Sample program that returns the hash value of a file in Java
Creating a sample program using the problem of a database specialist in DDD Improvement 2
I can't import the package that should have been conda installed in Anaconda
Creating a sample program using the problem of a database specialist in DDD Improvement 1
Code that deletes all files of the specified prefix in AWS S3 (Java)
About the solution of the error that occurred when trying to create a Japanese file of devise in the Docker development environment
Measure the size of a folder in Java
Delete all records in a MySQL database table
Create a native extension of Ruby in Rust