[Python / Ruby] Understanding with code How to get data from online and write it to CSV

There are occasional occasions when you need to use the API or scrape to export data to CSV in order to retrieve data online. At that time, I sometimes write by referring to the articles I posted before, but since it was scattered in multiple articles, I will summarize it in one. Personally, I often use Python or Ruby in these cases, so I will write a personal approach to this language.

Past articles

Get the upcoming weather from python weather api Topic model by LDA with gensim ~ Thinking about user's taste from Qiita tag ~ How to use Rails scraping method Mechanize Notes for handling Ruby CSV

Overview

This article is basically explained in the code base. To explain

--Get data using requests and BeautifulSoup in Python and convert it to CSV. --Get data using Mechanize in Ruby and convert it to CSV.

is.

Python

Use of API

urllib2

In the following article I wrote earlier, I used ʻurllib2` to get the data as shown in the code below. Get the upcoming weather from python weather api

I was using Python2 at the time, so it's Python2 code. There seems to be a change in the ʻurllib2` library in Python3

The urllib2 module has been split into urllib.request and urllib.error in Python 3. The 2to3 tool will automatically fix the source code import. (http://docs.python.jp/2/library/urllib2.html)

import urllib2, sys
import json

try: citycode = sys.argv[1]
except: citycode = '460010' #Default region
resp = urllib2.urlopen('http://weather.livedoor.com/forecast/webservice/json/v1?city=%s'%citycode).read()

#Convert the read JSON data to dictionary type
resp = json.loads(resp)
print '**************************'
print resp['title']
print '**************************'
print resp['description']['text']

for forecast in resp['forecasts']:
    print '**************************'
    print forecast['dateLabel']+'('+forecast['date']+')'
    print forecast['telop']
print '**************************'

requests

Now Python 3 uses requests. When I rewrite it, it looks like the following.

import requests, sys

try: citycode = sys.argv[1]
except: citycode = '460010' #Default region
resp = requests.get('http://weather.livedoor.com/forecast/webservice/json/v1?city=%s'%citycode)

resp = resp.json()
print('**************************')
print(resp['title'])
print('**************************')
print(resp['description']['text'])

for forecast in resp['forecasts']:
    print('**************************')
    print(forecast['dateLabel']+'('+forecast['date']+')')
    print(forecast['telop'])
print('**************************')

You can check the details in Document. I'm glad that this Document of Requests is written fairly carefully. Requests: HTTP for Humans

If you want to check how to use it, please see the following article. How to use Requests (Python Library) I think it will be helpful.

Scraping

Again, I would like to use requests to capture the data. The code below is a code for scraping the names of Japanese actors and actresses on wikipedia. Use BeautifulSoup as the acquired HTML parser. It is convenient because it can also be used in XML.

In other words, Python scraping is done with requests and BeautifulSoup.

I think it's easier to select BeautifulSoup with a CSS selector using the select method.

import requests
from bs4 import BeautifulSoup
import csv
import time

base_url = 'https://en.wikipedia.org/wiki/'

url_list = ['List_of_Japanese_actors', 'List_of_Japanese_actresses']

for i in range(len(url_list)):
    target_url = base_url + url_list[i]
    target_html = requests.get(target_url).text
    soup = BeautifulSoup(target_html, 'html.parser')
    names = soup.select('#mw-content-text > h2 + ul > li > a')


    for k, name in enumerate(names):
        print(name.get_text())

    time.sleep(1) 
    print('scraping page: ' + str(i + 1))

For more information Beautiful Soup Documentation For those who want a rough sketch Scraping with Python and Beautiful Soup

CSV output

Now, let's write the above Japanese actor / actress name to CSV.

It's easy with the csv library.

import requests
from bs4 import BeautifulSoup
import csv
import time

base_url = 'https://en.wikipedia.org/wiki/'

url_list = ['List_of_Japanese_actors', 'List_of_Japanese_actresses']

all_names = []

for i in range(len(url_list)):
    target_url = base_url + url_list[i]
    target_html = requests.get(target_url).text
    soup = BeautifulSoup(target_html, 'html.parser')
    names = soup.select('#mw-content-text > h2 + ul > li > a')


    for k, name in enumerate(names):
        all_names.append(name.get_text())

    time.sleep(1) 
    print('scraping page: ' + str(i + 1))

f = open('all_names.csv', 'w') 
writer = csv.writer(f, lineterminator='\n')
writer.writerow(['name'])
for name in all_names:
    writer.writerow([name])

f.close()

all_names.csv


name
Hiroshi Abe
Abe Tsuyoshi
Osamu Adachi
Jin Akanishi
...

The following articles are neatly summarized on how to use the csv library. Reading and writing CSV with Python

Although it is recommended in this article, it is not bad to use ʻopen for reading CSV, but it is recommended because it is quite common to use pandas` in consideration of subsequent analysis.

import csv

with open('all_name.csv', 'r') as f:
  reader = csv.reader(f)
  header = next(reader)

  for row in reader:
    print row
import pandas as pd
df = pd.read_csv('all_name.csv')

Ruby

Use of API

Ruby uses Mechanize. Parse and use the JSON received by'Mechanize`. We are doing the same thing as using the Python weather API above.

require 'mechanize'
require 'json'

citycode = '460010'
agent = Mechanize.new
page = agent.get("http://weather.livedoor.com/forecast/webservice/json/v1?city=#{citycode}")
data = JSON.parse(page.body)

puts '**************************'
puts data['title']
puts '**************************'
puts data['description']['text']

data['forecasts'].each do |forecast|
  puts '**************************'
  puts "#{forecast['dataLabel']}(#{forecast['date']})"
  puts forecast['telop']
end
puts '**************************'

As a bonus, I think you can also use httparty etc. jnunemaker/httparty However, Mechanize will suffice.

Scraping and CSV

Basically, I think the following article is sufficient. How to use Rails scraping method Mechanize

As shown below, use get to get the data, use the search method to extract the relevant part, and ʻinner_text or get_attribute` to extract the text and attributes.

require 'mechanize'

agent = Mechanize.new
page = agent.get("http://qiita.com")
elements = page.search('li a')

elements.each do |ele|
  puts ele.inner_text
  puts ele.get_attribute(:href)
end

This time, I will introduce the data acquisition using the post method, which is not done in the above article, with a concrete usage example.

The site The Oracle of Bacon is a site that returns "the number of bacon" when you enter the name of an actor. Although it is different from the content of this article, the "number of bacon" indicates how many times the actor's co-stars will be traced to reach the actor Kevin Bacon. [Six Degrees of Separation](https://ja.wikipedia.org/wiki/%E5%85%AD%E6%AC%A1%E3%81%AE%E9%9A%94%E3%81%9F% It is interesting to think about E3% 82% 8A). As of 2011, it is said that the average number of Facebook users in the world that separates any two is 4.74, which shows that the world is surprisingly small.

Here, I got the names of Japanese actors and actresses in the above python code and made them into CSV, so I would like to get the number of bacon for each of them and make them into CSV.

The CSV of actors and actresses is as follows.

all_names.csv


name
Hiroshi Abe
Abe Tsuyoshi
Osamu Adachi
Jin Akanishi
...

Below is the code. The point is how to use post of Mechanize. Also, I couldn't simply get the "number of bacon" I wanted to get from HTML (it was untagged text), so I used a regular expression. Reference: How to use Ruby regular expressions

The handling of CSV is described in Notes for handling Ruby CSV. Since CSV.open can be used in the same way as File.open, I used this here.

require 'mechanize'
require 'csv'
require 'kconv'

def get_bacon_num_to(person)

  agent = Mechanize.new
  page = agent.post('http://oracleofbacon.org/movielinks.php',  { a: 'Kevin Bacon', b: person })
  main_text = page.at('#main').inner_text.toutf8
  match_result = main_text.match(/has a Bacon number of ([0-9]+)/)

  bacon_number = 0

  if match_result.nil?
    puts "#{person}: Not found."
  else
    bacon_number = main_text.match(/has a Bacon number of ([0-9]+)/)[1]
    puts "#{person}: #{bacon_number}"
  end

  return bacon_number

end

people = CSV.read('all_names.csv', headers: true)

CSV.open("result.csv", 'w') do |file|
  people.each do |person|
    num = get_bacon_num_to(person['name'])
    file << [person['name'], num]
    sleep(1)
  end

end

At the end

I think there are various methods, but I think that the tools introduced this time can handle many things. Please try by all means try!

Recommended Posts

[Python / Ruby] Understanding with code How to get data from online and write it to CSV
[Python] How to read data from CIFAR-10 and CIFAR-100
Write CSV data to AWS-S3 with AWS-Lambda + Python
How to scrape image data from flickr with python
[Python] How to name table data and output it in csv (to_csv method)
Get mail from Gmail and label it with Python3
Write to csv with Python
[Note] How to write QR code and description in the same image with python
Get additional data to LDAP with python (Writer and Reader)
[Introduction to Python] How to get data with the listdir function
[Python] Write to csv file with Python
How to get started with Python
Precautions when inputting from CSV with Python and outputting to json to make it an exe
Offline real-time how to write E11 ruby and python implementation example
How to import CSV and TSV files into SQLite with Python
How to get followers and followers from python using the Mastodon API
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Difference in how to write if statement between ruby ​​and python
How to connect to Cloud Firestore from Google Cloud Functions with python code
How to build Python and Jupyter execution environment with VS Code
[Note] Get data from PostgreSQL with Python
Compress python data and write to sqlite
How to write Ruby to_s in Python
[Data science basics] I tried saving from csv to mysql with python
How to write the correct shebang in Perl, Python and Ruby scripts
How to get the date and time difference in seconds with python
Read CSV file with Python and convert it to DataFrame as it is
Make a decision tree from 0 with Python and understand it (4. Data structure)
How to create sample CSV data with hypothesis
How to read a CSV file with Python 2/3
Scraping tabelog with python and outputting to CSV
How to enjoy programming with Minecraft (Ruby, Python)
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
How to get started with the 2020 Python project (windows wsl and mac standardization)
Summary of how to read numerical data with python [CSV, NetCDF, Fortran binary]
[Introduction to Python] How to get the index of data with a for statement
How to generate a QR code and barcode in Python and read it normally or in real time with OpenCV
Reading CSV data from DSX object storage Python code
How to get more than 1000 data with SQLAlchemy + MySQLdb
How to convert JSON file to CSV file with Python Pandas
[Python] A memo to write CSV vertically with Pandas
How to get mouse wheel verdict with Python curses
Get data from database via ODBC with Python (Access)
How to create a kubernetes pod from python code
Quickly create a Python data analysis dashboard with Streamlit and deploy it to AWS
Get country code with python
Create a decision tree from 0 with Python and understand it (3. Data analysis library Pandas edition)
Get Youtube data with python
[Python] How to play with class variables with decorator and metaclass
How to connect to various DBs from Python (PEP 249) and SQLAlchemy
How to parse Java source code with AST (Abstract Syntax Tree) using ANTLR and Python
Get rid of dirty data with Python and regular expressions
I made a server with Python socket and ssl and tried to access it from a browser
How to get the current weather data and display it on the GUI while updating it automatically
Return the image data with Flask of Python and draw it to the canvas element of HTML
How to do Bulk Update with PyMySQL and notes [Python]
How to convert Youtube to mp3 and download it super-safely [Python]
Python code for writing CSV data to DSX object storage
Hit REST in Python to get data from New Relic
Get data from analytics API with Google API Client for python
How to get into the python development environment with Vagrant