Web scraping beginner with python

What

Shared notes for specific people.

A story of a beginner struggling to recreate a trial of great seniors when web scraping with python.

Thing you want to do

I would like to reproduce "Automatically read and write Google spreadsheets using Python, id: temcee".

Preparation

1: Enable phantomjs

"How to install PhantomJS on Windows 7, maechabin".

2: Enable Google API and issue key

According to "Automatically read and write Google spreadsheets using Python, id: temcee", get the json with the key saved. At this time, --Granting the access right of Spread Sheet of the output destination to the project that uses the API --Make a note of the Spread Sheet ID of the output destination. ["[GAS] How to read Google Spreadsheet ID, Pepper"](http://somen.site/2018/07/06/ [gas] How to read Google Spreadsheet id /) is helpful.

3: Execute

A little rewrite.

# coding=utf-8
import os
import json
import gspread
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from oauth2client.service_account import ServiceAccountCredentials


cred_info = json.load( open( "{Rewrite according to the environment}/spread_sheet_credential.json", "r") )


SCOPE_URL = 'https://spreadsheets.google.com/feeds'
CREDENTIAL_FILE_NAME = 'spread_sheet_credential.json'
TEMPLATE_FILE_NAME = 'spread_sheet_credential_template.txt'
SHEET_PROJECT_ID = cred_info['project_id']
SHEET_PRIVATE_KEY_ID = cred_info['private_key_id']
SHEET_PRIVATE_KEY = cred_info['private_key']
SHEET_CLIENT_EMAIL = cred_info['client_email']
SHEET_CLIENT_ID = cred_info['client_id']
SHEET_CLIENT_X509_CERT_URL = cred_info['client_x509_cert_url']


def write_news(sheet, link, max_loop_count):
    driver = webdriver.PhantomJS()
    driver.get(link)
    loop_count = 0
    while loop_count < max_loop_count:
        loop_count += 1
        print('-------------- {}Second page access... --------------'.format(loop_count))
        #Spread Sheet writing
        write_techcrunch_news_elements(driver, sheet)
        #Access to the next page
        driver = access_to_next(driver)


def access_to_next(driver):
    next = driver.find_element_by_link_text('next')
    #Visit a new page when a timeout occurs
    page_content = '/page/'
    url = driver.current_url
    splited_url_contents = url.split(page_content)
    next_url = splited_url_contents[0] + page_content + str(int(splited_url_contents[1].split('/')[0]) + 1)
    try:
        next.click()
    except Exception as e:
        print('Since Timeout has occurred, a new "{}To access.'.format(next_url))
        driver = webdriver.PhantomJS()
        driver.get(next_url)
    return driver


def write_techcrunch_news_elements(driver, sheet):
    #Wait up to 10 seconds, taking into account the time it takes for the page to fully load
    driver.set_page_load_timeout(10)
    title_dict = {}
    blocks = driver.find_elements_by_class_name('river-block')
    count = 0
    for block in blocks:
        count += 1
        ad_contain = None
        print('----- {}The second river-block is... -----'.format(count))
        try:
            ad_contain = block.find_element_by_class_name('ad-contain')
        except Exception as e:
            try:
                news_title = block.find_element_by_class_name('post-title').find_element_by_tag_name('a').text
                news_time = block.find_element_by_tag_name('time').get_attribute('datetime')
                title_dict[news_title] = news_time
                print('News No.{} title:{} date:{}'.format(count, news_title, news_time))
            except Exception as e:
                print('It was a sponsored article.'.format(count))
                continue
        if ad_contain is not None:
            print('It was an advertisement.??'.format(count))
    write_to_sheet(sheet, title_dict)


def write_to_sheet(sheet, dict):
    keys = list(dict.keys())
    values = list(dict.values())
    titles = sheet.col_values(1)
    start_row_num = len(titles) + 1
    start_row = str(len(titles) + 1)
    end_row = str(len(keys) + start_row_num)
    #Write to Spread Sheet
    update_cells_with_list(sheet, 'A'+start_row, 'A'+end_row, keys, value_input_option='USER_ENTERED')
    update_cells_with_list(sheet, 'B'+start_row, 'B'+end_row, values, value_input_option='USER_ENTERED')


def access_to_sheet(gid):
    #File for writing
    credential_file = open(CREDENTIAL_FILE_NAME, 'r')
    credentials = ServiceAccountCredentials.from_json_keyfile_name( CREDENTIAL_FILE_NAME, SCOPE_URL)
    # credentials = ServiceAccountCredentials.from_json_keyfile_name(CREDENTIAL_FILE_NAME, SCOPE_URL)
    client = gspread.authorize(credentials)
    return client.open_by_key(gid)


def update_cells_with_list(sheet, from_cell, to_cell, id_list, value_input_option):
    cell_list = sheet.range('{}:{}'.format(from_cell, to_cell))
    count_num = -1
    for cell in cell_list:
        count_num += 1
        try:
            val = id_list[count_num]
        except Exception as e:
            continue
        if val is None:
            continue
        cell.value = val
    print('{}From{}I'll write up to'.format(from_cell, to_cell))
    sheet.update_cells(cell_list, value_input_option=value_input_option)


#Sheet ID of Spread Sheet you want to write
sheet_gid = {Spreadsheet ID}
sheet_name ={Spreadsheet write destination sheet name}
target_link = 'https://jp.techcrunch.com/page/149/'
max_loop_count = 50
#Check the top writable line number of Spread Sheet
sheet = access_to_sheet(sheet_gid).worksheet(sheet_name)
write_news(sheet, target_link, max_loop_count)

Recommended Posts

Web scraping beginner with python
Web scraping with python + JupyterLab
Scraping with Python
Scraping with Python
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python (preparation)
WEB scraping with Python (for personal notes)
[Beginner] Python web scraping using Google Colaboratory
Getting Started with Python Web Scraping Practice
Scraping with Python + PhantomJS
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Practice web scraping with Python and Selenium
Scraping with Python + PyQuery
Easy web scraping with Python and Ruby
Scraping RSS with Python
[For beginners] Try web scraping with Python
AWS-Perform web scraping regularly with Lambda + Python + Cron
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
I tried scraping with Python
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
python super beginner tries scraping
Web scraping notes in python3
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Scraping with Selenium in Python
Easy web scraping with Scrapy
Scraping with Tor in Python
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
Web application with Python + Flask ② ③
I tried scraping with python
Streamline web search with python
Web application with Python + Flask ④
Data analysis for improving POG 1 ~ Web scraping with Python ~
Quick web scraping with Python (while supporting JavaScript loading)
[Scraping] Python scraping
Python beginners get stuck with their first web scraping
web scraping
Scraping with Node, Ruby and Python
Scraping with Selenium in Python (Basic)
Web scraping with BeautifulSoup4 (layered page)
Scraping with Python, Selenium and Chromedriver
Getting Started with Python Web Applications
Scraping Alexa's web rank with pyQuery
Scraping with Python and Beautiful Soup
Monitor Python web apps with Prometheus
Get web screen capture with python
Let's play with Excel with Python [Beginner]
Let's do image scraping with Python
Get Qiita trends with Python scraping
Beginners use Python for web scraping (4) ―― 1
"Scraping & machine learning with Python" Learning memo
Get weather information with Python & scraping
Web crawling, web scraping, character acquisition and image saving with python