A library for web scraping in Ruby. Web scraping is a technology that automatically extracts desired data and sentences from homepages. As a mechanism, the target data is extracted by looking at the html structure of the homepage and specifying the html tags and attributes that can specify the desired data.
The basic way to write the scraping code is as follows.
scraping.rb
require 'open-uri' #Library for loading the url of the page you want to scrape
require 'nokogiri' #Nokogiri library
#Read url
url = URI.open('https://Target page url')
#Read the html of the url destination page and Nokogiri::HTML::Convert to Document class
doc = Nokogiri::HTML(url)
By performing various operations on this doc, you can extract only the data of a specific part in html.
Suppose you want to scrape a page like this:
<html>
<head><title>Favorite movie</title></head>
<body>
<h3>movies</h3>
<div class="movies">
<div>
<h3 class="title">In the sky of the show shank</h3>
<p>Human drama</p>
<p id="year">1994</p>
</div>
</div>
<div class="movies">
<div>
<h3 class="title">Star Wars</h3>
<p>SF</p>
<p id="year">1977</p>
</div>
</div>
</body>
</html>
The css method is a method of the Nokogiri :: HTML :: Document class, and by giving a css selector as an argument, all the elements that satisfy it are extracted. If there are more than one, you will get an array containing all the relevant elements. (Strictly speaking, the Nokogiri :: XML :: NodeSet class, not an array). If no such element is found, an empty array is returned.
doc.css("h3") #Specified by tag
# >> [ <h3>movies</h3>, <h3 class="title">In the sky of the show shank</h3>, <h3 class="title">Star Wars</h3>]
doc.css(".title") #Specified by class
# >> [<h3 class="title">In the sky of the show shank</h3>, <h3 class="title">Star Wars</h3>]
doc.css("#year") #Specified by id
# >> [<p id="year">1994</p>, <p id="year">1977</p>]
doc.css("h1") #Specify no element
# >> []
There is also an extraction method that uses a similar css selector called at_css. This, unlike the css method, returns only the first element that gets caught, even if there are multiple matches. Also, unlike the css method, nil is returned if the target element is not found.
doc.at_css("h3") #Specified by tag
# >> <h3>movies</h3>
doc.at_css(".title") #Specified by class
# >> <h3 class="title">In the sky of the show shank</h3>
doc.at_css("#year") #Specified by id
# >> <p id="year">1994</p>
doc.at_css("h1") #Specify no element
# >> nil
Depending on the html structure of the page, it may not be possible to specify the desired data with one specification. At that time, you can also specify the following selector.
doc.css(".movies h3") #If you separate it with a half-width space,.Extract all h3 tags under the movies class
# >> [<h3 class="title">In the sky of the show shank</h3>, <h3 class="title">Star Wars</h3>]
doc.css(".movies > h3") # >When separated by.Extract the h3 tag directly under the movies class
# >> [<h3 class="title">In the sky of the show shank</h3>, <h3 class="title">Star Wars</h3>]
doc.css("h3 + p") # +When separated by, the element p immediately after parallel with the h3 tag is extracted.
# >> [<p>Human drama</p>, <p>SF</p>]
doc.css("h3 ~ p") # ~When separated by, the element p after that parallel to the h3 tag is extracted.
# >> [<p>Human drama</p>,<p id="year">1994</p>, <p>SF</p>, <p id="year">1977</p>]
You can also specify it in text.
doc.css("h3:contains('Star Wars')") # :contains('String')If you give, you can search the text.
# >> <h3 class="title">Star Wars</h3>
Although not explained this time, one element obtained is an object of the Nokogiri :: XML :: Element class, from which you can extract text and get the url specified by the a tag. I will. Since the method of specifying the css selector differs depending on the target html structure and what kind of data you want, it is difficult to pattern everything into an article. Let's extract the desired data by combining the specification methods introduced this time.