It refers to the process of extracting data such as specific images and titles from HTML on a website!
To scrape, use a library called ** "jsoup" **!
jsoup is a library for parsing HTML, and various classes for parsing can be used!
Now, let's write the following in pom.xml.
python
<dependencies>
//abridgement
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
</dependencies>
① Get HTML information from the website (2) Search the information of the specified tag element from the HTML information ③ Let's extract text and attribute values from HTML information
Use ** "Document Class" ** to work with HTML information. Create a variable of Documennt class and assign the acquired HTML information to the variable. The description below!
python
Document document = Jsoup.connect("url").get();
By specifying the URL string in the argument of the connect method, you can get the HTML of the website at that URL. Assign that information to a variable in the Document class.
To get the obtained tag information, use ** "select method" **.
python
Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3");
You are using the select method on the right side of the second line description. Since h3 is described as a character string in the argument, the information of the h3 element is obtained from the website of the specified URL and assigned to the variable of the Elements class. The Elements class is a class that holds the Element class in the form of a list, and the Element class is a class that represents HTML elements.
Use the ** "text method" ** to get the HTML text, and the ** "attr method" ** if you want to get the value of the attribute.
python
Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3");
for (Element element : elements) {
System.out.println(element.text());
}
Extract the text from the information of the "h3" element obtained by the select method and display it on the console!
python
Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3 a");
for (Element element : elements) {
System.out.println(element.attr("href"));
}
Extract the href attribute from the "h3 a" element information obtained by the select method and display it on the console!
Recommended Posts