Development environment

Spring Tool Suite

What is scraping?

It refers to the process of extracting data such as specific images and titles from HTML on a website!

Library required for scraping

To scrape, use a library called ** "jsoup" **!

jsoup is a library for parsing HTML, and various classes for parsing can be used!

Now, let's write the following in pom.xml.

`python`


<dependencies>

//abridgement

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.10.2</version>
    </dependency>
</dependencies>

Steps to scrape

① Get HTML information from the website (2) Search the information of the specified tag element from the HTML information ③ Let's extract text and attribute values from HTML information

① Get HTML information from the website

Use ** "Document Class" ** to work with HTML information. Create a variable of Documennt class and assign the acquired HTML information to the variable. The description below!

`python`


Document document = Jsoup.connect("url").get();

By specifying the URL string in the argument of the connect method, you can get the HTML of the website at that URL. Assign that information to a variable in the Document class.

(2) Search the information of the specified tag element from the HTML information

To get the obtained tag information, use ** "select method" **.

`python`


Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3");

You are using the select method on the right side of the second line description. Since h3 is described as a character string in the argument, the information of the h3 element is obtained from the website of the specified URL and assigned to the variable of the Elements class. The Elements class is a class that holds the Element class in the form of a list, and the Element class is a class that represents HTML elements.

③ Let's extract text and attribute values from HTML information

Use the ** "text method" ** to get the HTML text, and the ** "attr method" ** if you want to get the value of the attribute.

`python`


Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3");

for (Element element : elements) {
    System.out.println(element.text());
}

Extract the text from the information of the "h3" element obtained by the select method and display it on the console!

`python`


Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3 a");

for (Element element : elements) {
    System.out.println(element.attr("href"));
}

Extract the href attribute from the "h3 a" element information obtained by the select method and display it on the console!

Let's scrape with Java! !!

Development environment

What is scraping?

Library required for scraping

python

Steps to scrape

① Get HTML information from the website

python

(2) Search the information of the specified tag element from the HTML information

python

③ Let's extract text and attribute values from HTML information

python

python

`python`

`python`

`python`

`python`

`python`