Website scraping with jsoup

About this article

Note that I had the opportunity to scrape a website using Java's jsoup library. There are already many articles that explain in detail in an easy-to-understand manner, but it is a memo. Official: https://jsoup.org/

Introduction method

Just get the jar from the Official Site Download Page and add it to your library. It can also be installed from the Maven repository. If it is gradle, just add the following definition to build.gradle (version is the latest at the time of writing)

build.gradle


dependencies {
	compile('org.jsoup:jsoup:1.12.1')
}

Example of use

Take the case of extracting the date, title, and URL of "Notice" from the following page as an example.

<body> 
 <div class="section"> 
  <div class="block"> 
   <dl>
    <dt>2019.08.04</dt> 
    <dd>
     <a href="http://www.example.com/notice/0003.html">Notice 3</a>
    </dd> 
    <dt>2019.08.03</dt> 
    <dd>
     <a href="http://www.example.com/notice/0002.html">Notice 2</a>
    </dd> 
    <dt>2019.08.02</dt> 
    <dd>
     <a href="http://www.example.com/notice/0001.html">Notice 1</a>
    </dd> 
   </dl>
  </div>
 </div>
</body>

Extract with the following code.

Example.java


import java.io.IOException;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;

public class Example {

	public static void main(String[] args) throws IOException {

		//Get by sending a request with GET to the specified URL (post()POST is also possible if it is a method)
		Document document = Jsoup.connect("http://www.example.com").get();

		//Extract the element from the obtained document with the CSS selector.
		//This elements is in the HTML mentioned above<div class="block">List of size 1 which corresponds to the element of
		Elements elements = document.select(".section .block");

		//Get the 0th element of elements, and then get its child elements.
		// elements.get(0).childNode(0)= In the above HTML<dl>Elements of
		// elements.get(0).childNode(0).childNodes() = <dl>Of the child<dt>When<dd>List of elements
		List<Node> nodeList = elements.get(0).childNode(0).childNodes();

		//Extract the date, title, and URL of the notification from the nodeList with a for loop.
		for (int i = 0; i < nodeList.size() / 2; i++) {
			String newsDate = nodeList.get(i * 2).toString();
			String newsTitle = nodeList.get(i * 2 + 1).childNode(0).toString();
			String newsUrl = nodeList.get(i * 2 + 1).childNode(0).attr("href");

			System.out.println(newsDate);
			System.out.println(newsTitle);
			System.out.println(newsUrl);
		}
	}
}

The documentation for the classes used in this code is below. You should read it before using it.

Document(jsoup Java HTML Parser 1.12.1 API) Elements(jsoup Java HTML Parser 1.12.1 API) Element(jsoup Java HTML Parser 1.12.1 API) Node(jsoup Java HTML Parser 1.12.1 API)

About actual utilization

I use it to scrape the information page on the website of the graduate school where my relatives are enrolled.

The "Notices from the University" page on the university website, basically only notices that are not related to you are posted. Important notices are posted about once every few months, so you have to watch them often, which is a hassle. Therefore, I decided to create a batch that performs the following processing and run it with cron once an hour.

I also had some difficulty sending emails in Java. I will write about this later.

Reference article

jsoup usage memo: https://qiita.com/opengl-8080/items/d4864bbc335d1e99a2d7

Recommended Posts

Website scraping with jsoup
Easy web scraping with Jsoup
HTML parsing with JAVA (scraping)
Scraping with jsoup and taking the "Like Count" ranking of Qiita organizations
Scraping with puppeteer in Nuxt on Docker.
[Java + jsoup] Scraping Mercari's products for sale