[PYTHON] Various scraping

Here are some web scraping techniques. Get the main news of Yahoo! Japan using GAS, VBA, PHP, Python, respectively. In conclusion, GAS + Spreadsheet is easier to use and recommended.

GAS + Spreadsheet

good point

If you have a Google account, you do not need to prepare any environment and you can execute it regularly with GAS trigger.

Bad points

There are performance concerns with large-scale execution. For security reasons, the handling of important data is a bit strict.

code

`gas.js`


function yahoo() {
  var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('yahoo');
  //Specify the sheet name to write
  var getUrl = 'https://www.yahoo.co.jp/';
  //Specify the target page
  var content = UrlFetchApp.fetch(getUrl).getContentText('UTF-8');
  var content = content.replace(/\r?\n/g, '');
  var data = content.match(/<div class="_2jjSS8r_I9Zd6O9NFJtDN-" aria-label="Major news">(.*?)<\/div><div>/);
  //Set content blocks
  var data = data[0];
  var URLs = data.match(/<a class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" href="(.*?)"/g);
  //Collect URLs into an array
  var Titles = data.match(/<span class="fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR">(.*?)<\/span>/g);
  //Collect article name groups into an array
  for (var i = 0; i < 8; i++) {
    var URL = URLs[i];
    URL = URL.replace('<a class="yMWCYupQNdgppL-NV6sMi _3sAlKGsIBCxTUbNi86oSjt" href="', '');
    URL = URL.replace('"', '');
    sheet.getRange(i + 2, 1).setValue(URL); 
    
    var Title = Titles[i];
    Title = Title.replace('<span class="fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR">', '');
    Title = Title.replace('<\/span>', '');
    sheet.getRange(i + 2, 2).setValue(Title);
    
  }
  Browser.msgBox('Done！');
}

VBA + Excel

good point

For users with low literacy, I am grateful for the sentences that can be used in the familiar Excel.

Bad points

A Windows terminal is required as it depends on IE objects. IE's unique specifications often do not work well on modern websites.

Add library

On the VBA screen of Excel Menu bar> Tools> Reference settings Check the following two libraries. -Microsoft HTML Object Library ・ Microsoft Internet Controls

code

Sub GetData_Click()
Application.ScreenUpdating = False
Dim objIE As InternetExplorer
Dim htmlDoc As HTMLDocument
Dim NewsItem IHTMLElement
Dim NewsList, NewsTitle, NewsURL As IHTMLElementCollection
Dim PageURL As String
Dim LastRow As Long
PageURL = "https://www.yahoo.co.jp/"

'Create and set a new IE object
Set objIE = CreateObject("Internetexplorer.Application")
objIE.Visible = False

'Open URL in IE
objIE.navigate PageURL
'Waiting for reading
Do While objIE.Busy = True Or objIE.readyState < READYSTATE_COMPLETE
  DoEvents
Loop
'Set the HTML document loaded by objIE
Set htmlDoc = objIE.document

'Content acquisition of each item
Set NewsList = htmlDoc.getElementsByClassName("_2jjSS8r_I9Zd6O9NFJtDN-")

For Each NewsItem In NewsList
  'Specify the search requirements for each tag
  Set NewsTitle = NewsItem.getElementsByTagName("a")
  Set NewsURL = NewsItem.getElementsByClassName("fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR")
  'Get the last line
  LastRow = Worksheets("result").Cells(Rows.Count, 1).End(xlUp).Row
  'Fill in the corresponding value for each cell
  Worksheets("result").Cells(LastRow + 1, 1).Value = Right(NewsURL(0).href, 4)
  Worksheets("result").Cells(LastRow + 1, 2).Value = NewsTitle(0).innerText
  'Skip if not found
  On Error Resume Next
  Worksheets("result").Cells(LastRow + 1, 5).Value = Mid(Campagin(0).innerText, 2)
  Worksheets("result").Cells(LastRow + 1, 6).Value = Mid(Campagin(1).innerText, 1)
Next NewsItem

MsgBox "Done!"
End Sub

PHP + csv

good point

Performance is good. There are few libraries that depend on it.

Bad points

You need an environment to run PHP.

Required library

phpQuery-onefile.php Download the above file and put it in the same hierarchy as the php file of the main body

code

`scraping.php`


<?php
  require_once("phpQuery-onefile.php");
  $path = "xxx/xxx.csv"; //Specify csv to output
  $header = "URL".", ". "Title"."\n";
  $file = fopen($path, "w");
  $target = "https://www.yahoo.co.jp/";
  $html = file_get_contents($target);
  $list = phpQuery::newDocument($html)->find("div._2jjSS8r_I9Zd6O9NFJtDN-");
  for($i = 0; $i < 8; $i++){
   $url[$i] = phpQuery::newDocument($list)->find("li:eq($i)")->find("a")->attr("href");
   $title[$i] = str_replace(",", "", phpQuery::newDocument($list)->find("span.fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR:eq($i)")->text());
   $data .= ".$url[$i].", ".$title[$i]."."\n";
  }
    $content = $header . $data;
  fputs($file, $content);
  fclose($file);
?>

Python + csv

good point

Performance is good. If you're already using Python, it's easy to work with.

Bad points

It's a bit of a hassle to build a Python environment. If you are already using Python, you may be able to connect and use it in various ways, but if you are just scraping, PHP is overwhelmingly easier to build an environment.

Install the required libraries

pip3 install requests
pip3 install beautifulsoup4

code

`scraping.py`


# coding: utf-8
import requests
from bs4 import BeautifulSoup
import csv

r = requests.get("https://yahoo.co.jp/")
data = BeautifulSoup(r.content, "html.parser")
list = data.find("div", "_2jjSS8r_I9Zd6O9NFJtDN-")
links = list.find_all("a", "")
titles = list.find_all("span", "fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR")

with open('xxx/xxx.csv', 'w') as f:
#Decide where to place the csv file
    writer = csv.writer(f)
    writer.writerow(['url', 'title'] )
    for i in range(7):
        writer.writerow([links[i].attrs['href'], titles[i].text])

Bonus: For pages that require login

If you can't view the content without logging in, Python's chrome driver is useful. You can read more about how to use chromedriver in this article. Complete automatic operation of Chrome with Python + Selenium Also, please refer to this article for how to pass the PATH and notes on the version. [For selenium] How to install Chrome Driver with pip (no need to pass through, version can be specified)

Required library

pip3 install selenium
pip3 install chromedriver-binary

code

`login_scraping.py`


# coding: utf-8
import time, os, requests, sys, csv
from selenium import webdriver
import chromedriver_binary

#Specifying user name and password
user_id = "xxxxxxxxxx"
password = "xxxxxxxxxx"
download_dir = os.path.dirname(__file__)
#Specify login page URL
login = 'https://www.example.com/login/'
#Start Chrome by specifying options such as save destination
opt = webdriver.ChromeOptions()
opt.add_experimental_option("prefs", {
    "download.default_directory": download_dir,
    "download.prompt_for_download": False,
    "plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(options=opt)
#Open login page
driver.get(login)
time.sleep(3) #Wait until the page opens
#Send the key to the user name on the login screen
u = driver.find_element_by_name('xxx')
u.send_keys(user_id)
#Send key to password on login screen
p = driver.find_element_by_name('yyy')
p.send_keys(password)
p.submit()
time.sleep(3) #Wait until the page opens

driver.get("https://www.example.com/listdata/")
list = driver.find_element_by_tag_name("_2jjSS8r_I9Zd6O9NFJtDN-")
links = list.find_all("a", "")
titles = list.find_all("span", "fQMqQTGJTbIMxjQwZA2zk _3tGRl6x9iIWRiFTkKl3kcR")

with open('xxx/xxx.csv', 'w') as f:
#Decide where to place the csv file
    writer = csv.writer(f)
    writer.writerow(['url', 'title'] )
    for i in range(7):
        writer.writerow([links[i].attrs['href'], titles[i].text])

#Wait until the page opens
time.sleep(3)
driver.quit()

So far, we have introduced five methods. Depending on the usage, there are advantages and disadvantages to each. We look forward to helping you.