About this article

Note that I had the opportunity to scrape a website using Java's jsoup library. There are already many articles that explain in detail in an easy-to-understand manner, but it is a memo. Official: https://jsoup.org/

Introduction method

Just get the jar from the Official Site Download Page and add it to your library. It can also be installed from the Maven repository. If it is gradle, just add the following definition to build.gradle (version is the latest at the time of writing)

`build.gradle`


dependencies {
	compile('org.jsoup:jsoup:1.12.1')
}

Example of use

Take the case of extracting the date, title, and URL of "Notice" from the following page as an example.

<body> 
 <div class="section"> 
  <div class="block"> 
   <dl>
    <dt>2019.08.04</dt> 
    <dd>
     <a href="http://www.example.com/notice/0003.html">Notice 3</a>
    </dd> 
    <dt>2019.08.03</dt> 
    <dd>
     <a href="http://www.example.com/notice/0002.html">Notice 2</a>
    </dd> 
    <dt>2019.08.02</dt> 
    <dd>
     <a href="http://www.example.com/notice/0001.html">Notice 1</a>
    </dd> 
   </dl>
  </div>
 </div>
</body>

Extract with the following code.

`Example.java`


import java.io.IOException;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;

public class Example {

	public static void main(String[] args) throws IOException {

		//Get by sending a request with GET to the specified URL (post()POST is also possible if it is a method)
		Document document = Jsoup.connect("http://www.example.com").get();

		//Extract the element from the obtained document with the CSS selector.
		//This elements is in the HTML mentioned above<div class="block">List of size 1 which corresponds to the element of
		Elements elements = document.select(".section .block");

		//Get the 0th element of elements, and then get its child elements.
		// elements.get(0).childNode(0)= In the above HTML<dl>Elements of
		// elements.get(0).childNode(0).childNodes() ＝ <dl>Of the child<dt>When<dd>List of elements
		List<Node> nodeList = elements.get(0).childNode(0).childNodes();

		//Extract the date, title, and URL of the notification from the nodeList with a for loop.
		for (int i = 0; i < nodeList.size() / 2; i++) {
			String newsDate = nodeList.get(i * 2).toString();
			String newsTitle = nodeList.get(i * 2 + 1).childNode(0).toString();
			String newsUrl = nodeList.get(i * 2 + 1).childNode(0).attr("href");

			System.out.println(newsDate);
			System.out.println(newsTitle);
			System.out.println(newsUrl);
		}
	}
}

The documentation for the classes used in this code is below. You should read it before using it.

Document(jsoup Java HTML Parser 1.12.1 API) Elements(jsoup Java HTML Parser 1.12.1 API) Element(jsoup Java HTML Parser 1.12.1 API) Node(jsoup Java HTML Parser 1.12.1 API)

About actual utilization

I use it to scrape the information page on the website of the graduate school where my relatives are enrolled.

The "Notices from the University" page on the university website, basically only notices that are not related to you are posted. Important notices are posted about once every few months, so you have to watch them often, which is a hassle. Therefore, I decided to create a batch that performs the following processing and run it with cron once an hour.

Get notifications by scraping the notification page using the method described in this article.
Compared with the past notifications recorded in the local DB, only the newly posted ones are extracted.
Email notification of newly posted notifications

I also had some difficulty sending emails in Java. I will write about this later.

Reference article

jsoup usage memo: https://qiita.com/opengl-8080/items/d4864bbc335d1e99a2d7

Website scraping with jsoup