I used to scrape using Python's ** BeautifulSoup **, but I tried it because it can be realized with a library called ** Nokogiri ** in Ruby as well.
First of all, from the completed code and the finished product
scraping.rb
require 'nokogiri'
require 'open-uri'
require "csv"
require "byebug"
url_base = "https://news.yahoo.co.jp/"
def get_categories(url)
html = open(url)
doc = Nokogiri::HTML.parse(html)
categories = doc.css(".yjnHeader_sub_cat li a")
categories.map do |category|
cat_name = category.text
cat = category[:href]
end
end
@cat_list = get_categories(url_base)
@infos = []
@cat_list.each do |cat|
url = "#{url_base + cat}"
html = open(url)
doc = Nokogiri::HTML.parse(html)
titles = doc.css(".topicsListItem a")
i = 1
titles.each do |title|
@infos << [i,title.text]
i += 1
end
end
CSV.open("result.csv", "w") do |csv|
@infos.each do |info|
csv << info
puts "-------------------------------"
puts info
end
end
I will explain each of them.
require 'nokogiri'
require 'open-uri'
require "csv"
require "byebug"
This time I will use ** Nokogiri and open-uri **, and ** csv ** for CSV storage.
Nokogiri is a Ruby library that parses HTML and XML code and extracts them with selectors. The selector can be specified by ** xpath ** in addition to ** css **, so scraping can be done smoothly even on pages with complicated structures.
This time, we will get the title of each topic and finally put it together in a CSV file.
The topic page seems to be connected from the link (a tag) in the li of the class ** yjnHeader_sub **.
url_base = "https://news.yahoo.co.jp/"
def get_categories(url)
html = open(url)
#Get the HTML code of the URL read by parse
doc = Nokogiri::HTML.parse(html)
#Use css selector to get all a tags connected to the previous category
categories = doc.css(".yjnHeader_sub_cat li a")
categories.map do |category|
#Contents of href from the acquired a tag(URL of the link destination)Take out one by one and return
cat = category[:href]
end
end
#@cat_I will summarize the links obtained as a list
@cat_list = get_categories(url_base)
We will get the title for each topic using the link we got earlier.
@infos = []
@cat_list.each do |cat|
#The URL of the topic page is the original URL+Because of the obtained URL
url = "#{url_base + cat}"
html = open(url)
doc = Nokogiri::HTML.parse(html)
titles = doc.css(".topicsListItem a")
i = 1
titles.each do |title|
#Store topic numbers and titles as a set for summarizing in CSV
@infos << [i,title.text]
i += 1
end
end
Save the last summarized title in CSV.
#Using CSV library"result.csv"Newly created
CSV.open("result.csv", "w") do |csv|
@infos.each do |info|
#The items used as logs are output while being added to csv.
csv << info
puts "-------------------------------"
puts info
end
end
However, if it is left as it is, the characters will probably be garbled, so save it again with a BOM. (Originally, it was correct to do it while saving CSV, but it didn't work, so I took care of it here.)
Open "result.csv" with ** Notepad ** and select Overwrite.
At this time, select ** UTF-8 (with BOM) ** and save again.
When you open csv again, the garbled characters are resolved.
I think there are still many points that have yet to be reached, so if you have any suggestions, I would appreciate it if you could comment.