Now that you've learned how to implement scraping, I'll write it as a learning output.
After reading this article, what is scraping? What should I do to implement it? I can understand.
What is scraping? To learn information from a website and process it to generate new information. For example, visiting various restaurant sites and creating a price list.
I want to know more details! If you think that, please search on Google.
Now, I will write how to implement scraping.
1 Install the gemfile "mechanize"
gem 'mechanize'
Then type bundle install in the terminal
2 Create an instance of the Mechanize class
agent = Mechanize.new #Create an instance of the Mechanize class and assign it to the variable agent
3 Get website HTML information Use the instance method "get" of the Mechanize class to get the HTML of the website you want to scrape.
page = agent.get("https://www.google.com/?hl=ja")
4 Use the search method to search for HTML elements The search method is used for the object that contains the page information obtained by the get method. As a result, the content of the specified HTML element can be searched from the acquired HTML information of the website. Even if there is only one corresponding HTML tag element, the return value will be returned in the form of an array.
agent = Mechanize.new
page = agent.get("https://www.google.com/?hl=ja")
elements = page.search('h1')
↑ The information of h1 element in https://www.google.com/?hl=ja is acquired.
5 inner_text method If you want to get the text of the HTML information obtained by the search method, use the inner_text method.
agent = Mechanize.new
page = agent.get("URL of the website you want to scrape")
elements = page.search('h2 a') #Search for a element under h2 element
elements.each do |ele|
puts ele.inner_text
end
6 get_attribute method If you want to get the value of HTML attribute, use get_attribute method. For example, the HTML of the a tag element has an attribute "href" whose value is the URL of the link destination. You can get the value of the attribute specified by the argument by writing get_attribute (attribute).
agent = Mechanize.new
page = agent.get("URL of the website you want to scrape")
elements = page.search('h2 a') #Search for a element under h2 element
elements.each do |ele|
puts ele.get_attribute('href') # puts ele[:href]May be
end
● Create an instance of the Mechanize class ● Get the HTML information of the website with the instance method .get (URL of the website for which you want to get information) of the Mechanize class. ● Learn by specifying the tag element with the desired data with the search method ● Learn the information you want by using the inner_text and get_attribute methods for the HTML information of the acquired tag element.
Recommended Posts