It is a fishing title. I'm sorry. In fact, we will create a database of all ** "699 TV / Radio" books that have been circulating in Japan during the last century. Also, ** Since a large amount of automatic access to the server of the National Diet Library will be performed, please do so at your own risk **.
Target audience
--People who want to search the National Diet Library automatically (without using API) --People who can do java to some extent --People who don't get angry with private methods --Mentally adult who can understand self-responsibility (very rarely, there are social gangsters who unfortunately do not fall under this)
If you do a detailed search with the National Diet Library Search and search only with the classification symbol "699", You can search for all books on television and radio from 1925 to 2020. But unfortunately, only the top 500 results can be seen in a single search. (The 501st and subsequent items are not displayed in the search results in the first place) You can also refine your search by year of publication, but in any year from 1925 to 2020, the results did not exceed 500 (in one year). So, if you do a search and get the results every year, it will take some time, but let's try to see all the results.
A crawler is a program that collects information from the web. Let's design a program that collects information from the National Diet Library search by calling it an "ndl crawler". It can be designed by the following procedure.
First, when I searched under the conditions shown in Figure 1.1.1.1, the url of the search results (thumbnail display) was as follows.
https://iss.ndl.go.jp/books?datefrom=1234&place=place&rft.isbn=isbn&rft.title=title&dateto=5678&rft.au=author&subject=subject&do_remote_search=true&rft.pub=publisher&display=thumbnail&ndc=genre&search_mode=advanced
Figure 1.1.1.1 url Search criteria for parameter observation
By comparison, we can see that it is as shown in Table 1.1.1.1. (However, only the "result page number" was confirmed in another trial.)
Table 1.1.1.1 National Diet Library url parameters
item | Parameters | value |
---|---|---|
title | rft.title | |
Author editor | rft.au | |
the publisher | rft.pub | |
Publication year section start point | datefrom | |
Publication year section end point | dateto | |
subject | subject | |
Classification symbol | ndc | |
ISBN/ISSN | rft.isbn | |
Place of publication | place | |
Result page number | page | |
Thumbnail display of results | display | thumbnail |
Other than that 1/2 | do_remote_search | true |
Other than that, I'll keep it for the time being 2/2 | search_mode | advance |
Create the java package ndl
, and create the following NDL Rawler.java
in it.
(The Parser class that appears in the code is described in Section 1-1-3.
Briefly, the content of the result page passed to the constructor is molded into csv with the parse ()
method.
The has15 ()
method returns a boolean value indicating whether the result is 15. )
(The WebGetter class that appears in the code just fetches the html source from the internet with the get
method)
NDLCrawler.java
package ndl;
import java.io.*;
import java.net.*;
public class NDLCrawler
{
private String url = "https://iss.ndl.go.jp/books?",
title="", author="", publisher="", yearfrom="",yearto="", subject="", bunrui="", isbn_issn="", place="";
public void setTitle(String str){title=str;} public void setAuthor(String str){author=str;} public void setPublisher(String str){publisher=str;} public void setYearfrom(String str){yearfrom=str;} public void setYearto(String str){yearto=str;} public void setSubject(String str){subject=str;} public void setBunrui(String str){bunrui=str;} public void setIsbn_issn(String str){isbn_issn=str;} public void setPlace(String str){place=str;}
public String crawle()
{
System.out.println("Crawler activation");
String csv="";
String urlWithGet = url+ "rft.title=" + title + "&rft.au=" + author + "&rft.pub=" + publisher + "&datefrom=" + yearfrom + "&dateto=" + yearto + "&subject=" + subject + "&ndc=" + bunrui + "&rft.isbn=" + isbn_issn + "&place=" + place;
urlWithGet = urlWithGet + "&do_remote_search=true&display=thumbnail&search_mode=advanced";
System.out.println(" url:"+urlWithGet+"&page=(page number)");
WebGetter wg = new WebGetter();
try {
for(int page=1; page<=34; page++)
{
System.out.println(" "+page+"Page page");
String source = wg.get(urlWithGet+"&page="+page);
Parser p = new Parser(source, false);
csv = csv + p.parse().replaceFirst("^(\r\n|\r|\n)", "");
if(!p.has15()) break;
}
System.out.println("Crawler end");
return csv;
} catch (IOException e) {e.printStackTrace();return null;}
}
}
The WebGetter class looks like this: (Add after NDLCrawler.java)
WebGetter class
/**
*
*Reference site:https://www.javalife.jp/2018/04/25/java-%E3%82%A4%E3%83%B3%E3%82%BF%E3%83%BC%E3%83%8D%E3%83%83%E3%83%88%E3%81%AE%E3%82%B5%E3%82%A4%E3%83%88%E3%81%8B%E3%82%89html%E3%82%92%E5%8F%96%E5%BE%97%E3%81%99%E3%82%8B/
*
*/
class WebGetter
{
String get(String url) throws MalformedURLException, IOException
{
InputStream is = null; InputStreamReader isr = null; BufferedReader br = null;
try {
URLConnection conn = new URL(url).openConnection();
is = conn.getInputStream();
isr = new InputStreamReader(is);
br = new BufferedReader(isr);
String line, source="";
while((line = br.readLine()) != null)
source=source+line+"\r\n";
return source;
}finally {br.close();isr.close();is.close();}
}
}
Regarding the html source of the search results in the "National Diet Library Search", we found the following rules.
--<a href="https://iss.ndl.go.jp/books/ (book ID consisting of alphanumeric characters and hyphens) ">
(character string 1) always comes immediately after the book title .. The reverse is true at least once for each book.
--At least once for each book, immediately after the title </a> (line feed) (white space) + </ h3> (line feed) (white space) + <p> (line feed) (white space) +
Is coming. Let's call this string 2.
―― (Author name) /.*(,(Author name) /.*)*
(Character string 3) always comes immediately after "Character string 1, Book title, Character string 2".
――The publisher name comes after 2 lines of the last line of "Character string 1, book title, character string 2-3", the publication year comes after 3 lines, and the series name comes after 4 lines. If there is missing information, an empty value will be entered and the line will not be skipped.
If you make a java program that creates a csv file with book titles arranged according to this rule (and a few exceptions), it will be as follows. (The process of converting html source to csv is called Parse.)
Parser.java
package ndl;
public class Parser
{
private boolean has15;
private String csv;
Parser(String source, boolean needHeader)
{
this.csv=needHeader?"National Diet Library Link,title,Author,Publisher,Year,series\n":"\n";
String[] books = divide(source);//「<a href="https://iss.ndl.go.jp/books/Separated by
books = remove0(books);//Only the beginning is meaningless data, so cut it off
has15 = books.length==15;//By default, the number of search results is 15 per page
String link, title, publisher, year, series;
String[] authors;
for(String book : books)//About each book
{
book = book.replaceAll("((\r\n)|\r|\n)( |\t)*<span style=\"font-weight:normal;\">[^<]+</span>","");//If the series is numbered, you can apply it to the "law" by cutting that information.
link = getLink(book).replaceAll(",", "、");
title = getTitle(book).replaceAll(",", "、");
authors = getAuthors(book);
publisher = getPublisher(book).replaceAll(",", "、");
year = getYear(book).replaceAll(",", "、");
series = getSeries(book).replaceAll(",", "、");//Extract detailed information
for(String author : authors)//Convert to csv
csv = csv + link+","+title+","+author.replaceAll(",", "、")+","+publisher+","+year+","+series+"\n";
}
}
public boolean has15(){return has15;}
public String parse() {return csv;}
//Private methods that aren't really good
private String[] divide(String source){return source.split("<a href=\"https://iss\\.ndl\\.go\\.jp/books/", -1);}
private String[] remove0(String[] before)
{
String[] after = new String[before.length-1];
for(int i=1; i<before.length; i++)after[i-1]=before[i];
return after;
}
private String getLink(String book){return "https://iss.ndl.go.jp/books/"+book.split("\"")[0];}//「"You just have to return the 0th separated by ""
private String getTitle(String book){return book.split("<|>")[1];}//「<Or ">You just have to return the first one separated by
private String[] getAuthors(String book){return book.split("(\r\n)|\r|\n")[3].replaceFirst("( |\t)*", "").split("/([^,])+,?");}
private String getPublisher(String book){return book.split("(\r\n)|\r|\n")[5].replaceFirst("( |\t)*", "");}
private String getYear(String book){return book.split("(\r\n)|\r|\n")[6].replaceFirst("( |\t)*", "");}
private String getSeries(String book){return book.split("(\r\n)|\r|\n")[7].replaceFirst("( |\t)*", "");}
}
Designed a crawler class to access the National Diet Library Search in Sections 1-1-2 In Section 1-1-3, we implemented a class that generates csv based on the information obtained by the crawler. In Sections 1-1-4, let's design a class that actually controls the crawler using these classes.
As the control content, specify each of 1925 to 2020 in the for statement, specify 699 as the classification number, operate the crawler, and write the completed csv to the file. Also, if you do not know the number of years information, it seems that it is treated as "1900", so consider this as well.
Main.java
package ndl;
import java.io.*;
public class Main
{
public static void main(String...args)
{
String header = "National Diet Library Link,title,Author,Publisher,Year,series,library\n";
NDLCrawler c = new NDLCrawler();
c.setBunrui("699");
generateCsv(c);
}
private static void generateCsv(NDLCrawler c)
{
System.out.println(1900);
c.setYearfrom("1900");
c.setYearto("1900");
output(c.crawle());//Write at the end
for(int year=1925; year<=2020; year++)
{
System.out.println(" "+year);
c.setYearfrom(""+year);
c.setYearto(""+year);
output(c.crawle());//Write at the end
}
}
private static void output(String csv)
{
String path = "D:\\all699.csv";//The path can be changed arbitrarily
System.out.println("output"+csv);
try{
FileWriter fw = new FileWriter(path, true);//Addition mode at the end with the second argument true
fw.write(csv);
fw.close();
} catch (IOException e) {e.printStackTrace();}
}
}
The experimental environment and conditions are shown in Table 2-1.
Table 2-1. Experimental environment and conditions
Item | Value |
---|---|
OS | windows10 |
Software | Eclipse IDE for Enterprise Java Developers (4.11.0) |
Providers and Sources | Jupiter Telecommunication Co. Ltd (210.194.32.203) |
Program start date and time (Japan time) | January 17, 2020 20:44:00 |
Program stop time and operation period, reason for stop | January 17, 2020 21:39:44 (about 56 minutes, normal end) |
The output file all699.csv (uploaded to github) has 12145 lines.
In addition, after the experiment, I accessed the National Diet Library search with a browser and looked at the number information in the search results under the same conditions, which was 12633.
Due to the inconsistency in the number of books, we investigated and found that the total number of books given 1900 and 1925-2020 as publication year information is 12030.
This is slightly less than 12145 cases, but it is thought that it is due to the specification of the Parser.getAuthors
method that states that" when there are multiple author information, each line is assigned separately. "
In the National Diet Library Search, you can search the collections of several local libraries in addition to the National Diet Library. I'm interested in using this to apply it to research that classifies "books that are placed everywhere" and "books that are not", so I would like to try this.
Recommended Posts