Here are some web scraping techniques. Get the main news of Yahoo! Japan using GAS, VBA, PHP, Python, respectively. In conclusion, GAS + Spreadsheet is easier to use and recommended.
GAS + Spreadsheet
If you have a Google account, you do not need to prepare any environment and you can execute it regularly with GAS trigger.
There are performance concerns with large-scale execution. For security reasons, the handling of important data is a bit strict.
gas.js
function yahoo() {
var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('yahoo');
//Specify the sheet name to write
var getUrl = 'https://www.yahoo.co.jp/';
//Specify the target page
var content = UrlFetchApp.fetch(getUrl).getContentText('UTF-8');
var content = content.replace(/\r?\n/g, '');
var data = content.match(/<div class="_2jjSS8r_I9Zd6O9NFJtDN-" aria-label="Major news">(.*?)<\/div><div>/);
//Set content blocks
var data = data[0];
var URLs = data.match(/<a class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" href="(.*?)"/g);
//Collect URLs into an array
var Titles = data.match(/<span class="fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR">(.*?)<\/span>/g);
//Collect article name groups into an array
for (var i = 0; i < 8; i++) {
var URL = URLs[i];
URL = URL.replace('<a class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" href="', '');
URL = URL.replace('"', '');
sheet.getRange(i + 2, 1).setValue(URL);
var Title = Titles[i];
Title = Title.replace('<span class="fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR">', '');
Title = Title.replace('<\/span>', '');
sheet.getRange(i + 2, 2).setValue(Title);
}
Browser.msgBox('Done!');
}
VBA + Excel
For users with low literacy, I am grateful for the sentences that can be used in the familiar Excel.
A Windows terminal is required as it depends on IE objects. IE's unique specifications often do not work well on modern websites.
On the VBA screen of Excel Menu bar> Tools> Reference settings Check the following two libraries. -Microsoft HTML Object Library ・ Microsoft Internet Controls
Sub GetData_Click()
Application.ScreenUpdating = False
Dim objIE As InternetExplorer
Dim htmlDoc As HTMLDocument
Dim NewsItem IHTMLElement
Dim NewsList, NewsTitle, NewsURL As IHTMLElementCollection
Dim PageURL As String
Dim LastRow As Long
PageURL = "https://www.yahoo.co.jp/"
'Create and set a new IE object
Set objIE = CreateObject("Internetexplorer.Application")
objIE.Visible = False
'Open URL in IE
objIE.navigate PageURL
'Waiting for reading
Do While objIE.Busy = True Or objIE.readyState < READYSTATE_COMPLETE
DoEvents
Loop
'Set the HTML document loaded by objIE
Set htmlDoc = objIE.document
'Content acquisition of each item
Set NewsList = htmlDoc.getElementsByClassName("_2jjSS8r_I9Zd6O9NFJtDN-")
For Each NewsItem In NewsList
'Specify the search requirements for each tag
Set NewsTitle = NewsItem.getElementsByTagName("a")
Set NewsURL = NewsItem.getElementsByClassName("fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR")
'Get the last line
LastRow = Worksheets("result").Cells(Rows.Count, 1).End(xlUp).Row
'Fill in the corresponding value for each cell
Worksheets("result").Cells(LastRow + 1, 1).Value = Right(NewsURL(0).href, 4)
Worksheets("result").Cells(LastRow + 1, 2).Value = NewsTitle(0).innerText
'Skip if not found
On Error Resume Next
Worksheets("result").Cells(LastRow + 1, 5).Value = Mid(Campagin(0).innerText, 2)
Worksheets("result").Cells(LastRow + 1, 6).Value = Mid(Campagin(1).innerText, 1)
Next NewsItem
MsgBox "Done!"
End Sub
PHP + csv
Performance is good. There are few libraries that depend on it.
You need an environment to run PHP.
phpQuery-onefile.php Download the above file and put it in the same hierarchy as the php file of the main body
scraping.php
<?php
require_once("phpQuery-onefile.php");
$path = "xxx/xxx.csv"; //Specify csv to output
$header = "URL".", ". "Title"."\n";
$file = fopen($path, "w");
$target = "https://www.yahoo.co.jp/";
$html = file_get_contents($target);
$list = phpQuery::newDocument($html)->find("div._2jjSS8r_I9Zd6O9NFJtDN-");
for($i = 0; $i < 8; $i++){
$url[$i] = phpQuery::newDocument($list)->find("li:eq($i)")->find("a")->attr("href");
$title[$i] = str_replace(",", "", phpQuery::newDocument($list)->find("span.fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR:eq($i)")->text());
$data .= ".$url[$i].", ".$title[$i]."."\n";
}
$content = $header . $data;
fputs($file, $content);
fclose($file);
?>
Python + csv
Performance is good. If you're already using Python, it's easy to work with.
It's a bit of a hassle to build a Python environment. If you are already using Python, you may be able to connect and use it in various ways, but if you are just scraping, PHP is overwhelmingly easier to build an environment.
pip3 install requests
pip3 install beautifulsoup4
scraping.py
# coding: utf-8
import requests
from bs4 import BeautifulSoup
import csv
r = requests.get("https://yahoo.co.jp/")
data = BeautifulSoup(r.content, "html.parser")
list = data.find("div", "_2jjSS8r_I9Zd6O9NFJtDN-")
links = list.find_all("a", "")
titles = list.find_all("span", "fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR")
with open('xxx/xxx.csv', 'w') as f:
#Decide where to place the csv file
writer = csv.writer(f)
writer.writerow(['url', 'title'] )
for i in range(7):
writer.writerow([links[i].attrs['href'], titles[i].text])
If you can't view the content without logging in, Python's chrome driver is useful. You can read more about how to use chromedriver in this article. Complete automatic operation of Chrome with Python + Selenium Also, please refer to this article for how to pass the PATH and notes on the version. [For selenium] How to install Chrome Driver with pip (no need to pass through, version can be specified)
pip3 install selenium
pip3 install chromedriver-binary
login_scraping.py
# coding: utf-8
import time, os, requests, sys, csv
from selenium import webdriver
import chromedriver_binary
#Specifying user name and password
user_id = "xxxxxxxxxx"
password = "xxxxxxxxxx"
download_dir = os.path.dirname(__file__)
#Specify login page URL
login = 'https://www.example.com/login/'
#Start Chrome by specifying options such as save destination
opt = webdriver.ChromeOptions()
opt.add_experimental_option("prefs", {
"download.default_directory": download_dir,
"download.prompt_for_download": False,
"plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(options=opt)
#Open login page
driver.get(login)
time.sleep(3) #Wait until the page opens
#Send the key to the user name on the login screen
u = driver.find_element_by_name('xxx')
u.send_keys(user_id)
#Send key to password on login screen
p = driver.find_element_by_name('yyy')
p.send_keys(password)
p.submit()
time.sleep(3) #Wait until the page opens
driver.get("https://www.example.com/listdata/")
list = driver.find_element_by_tag_name("_2jjSS8r_I9Zd6O9NFJtDN-")
links = list.find_all("a", "")
titles = list.find_all("span", "fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR")
with open('xxx/xxx.csv', 'w') as f:
#Decide where to place the csv file
writer = csv.writer(f)
writer.writerow(['url', 'title'] )
for i in range(7):
writer.writerow([links[i].attrs['href'], titles[i].text])
#Wait until the page opens
time.sleep(3)
driver.quit()
So far, we have introduced five methods. Depending on the usage, there are advantages and disadvantages to each. We look forward to helping you.
Recommended Posts