Easy web scraping with Jsoup

Introduction

Web scraping is easy with a library called Jsoup. If you manage it, you can also do Web Automation. I will also introduce the simple processing that can be done with Jsoup as an API.

Notes

Whether scraping is possible depends on the rules of the other party. For example, scraping to sites that prohibit scraping, such as Amazon, is prohibited. In some cases, legal measures may be taken, so be sure to follow the rules.

Try using

This is the published API.

"Output URL as HTML" "Extract text from URL and output" "Extract all Href links inside URL from URL and output" "Extract all src links of img tag inside URL from URL and output" I will.

Trial [Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87% Let's URL2TEXT the wikipedia article about E3% 82% A3% E3% 82% A2). I skipped the middle, but I can get the text of the text like this.

{
  "log": "",
  "startTime": "1481723361899",
  "endTime": "1481723362888",
  "processTime": "989",
  "text": "Wikipedia- Wikipedia Wikipedia出典:Free encyclopedia "Wikipedia" Destination:Guidance, Search This item describes Wikipedia as an encyclopedic article.
...
Last updated October 2, 2016(Day) 09:22 (Day時は個人設定で未設定ならばUTC)。 テキストはクリエイティブ・コモンズ 表示-Available under an inheritance license. Additional conditions may apply. Please refer to the Terms of Use for details. Privacy Policy About Wikipedia Disclaimer Developer Cookie statement Mobile View"
}

Next, [Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%A6%E3%82%A3%E3%82%AD%E3%83%9A%E3%83%87% Let's URL2SRC the wikipedia article about E3% 82% A3% E3% 82% A2). I skipped the middle, but you can get the URL of img in the text like this.

{
  "log": "",
  "startTime": "1481724733607",
  "endTime": "1481724734550",
  "processTime": "943",
  "links": [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png ",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/Wikipedia%27s_W.svg/20px-Wikipedia%27s_W.svg.png ",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/5/5f/Disambig_gray.svg/25px-Disambig_gray.svg.png ",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/8/80/Wikipedia-logo-v2.svg/100px-Wikipedia-logo-v2.svg.png ",
...
}

API call sample

-Java sample

Former story

-[API] Extract HTML from URL (URL2HTML, URL2TEXT) -[API] Take the link in the URL from the URL (URL2HREF, URL2SRC)

in conclusion

If you can do web scraping, the weekend hackathon will progress in various ways. Of course, follow the rules and use it without any inconvenience.

Recommended Posts

Easy web scraping with Jsoup
Website scraping with jsoup
Easy BDD with (Java) Spectrum?
Easy microservices with Spark Framework!
Test Web API with junit
Web application built with docker (1)
Easy library introduction with Maven!
HTML parsing with JAVA (scraping)
Easy JDBC calls with Commons DbUtils
Web browsing with ARKit + SceneKit + Metal
Build a web application with Javalin
Easy Pub/Sub messaging with Apache Kafka
Web application creation with Nodejs with Docker
Easy database access with Java Sql2o
Scraping with jsoup and taking the "Like Count" ranking of Qiita organizations