[PYTHON] Scraping, preprocessing and writing to postgreSQL

I always wanted to analyze using baseball data, so I always "copy the necessary parts of the site, paste it into excel, save it as csv". However, I thought it was a hassle, so I tried to "scrap and preprocess and write to postgreSQL".

Installation of required libraries

Scraping etc.


import pandas as pd
from bs4 import BeautifulSoup
import datetime

postgreSQL write

import psycopg2
from sqlalchemy import create_engine

Data acquisition

This time, we will get the performance data of the giant batter with "pandas.read_html". It seems that the table (table tag) of the page is fetched in order and the list of dataframe is returned.

raw_data = pd.read_html("https://baseball-freak.com/audience/giants.html", flavor="bs4")

The list will be returned, so it will look like this スクレイピング結果.JPG

The first data frame looks like this. 1つめのdataframe.JPG

The second one should contain the data you want this time, so check it.

data_set = raw_data[1] #There are multiple tables on the page, and this time I want the second table[1]Specify
data_set

dataset.JPG

Check the contents of the data frame

Check the contents of the acquired data.

pd.concat(
    [data_set.count().rename('number'),
    data_set.dtypes.rename('Data type'),
    data_set.nunique().rename('Unique number'),
    data_set.isnull().sum().rename('Number of nulls'),
    (data_set.isnull().sum() * 100 / data_set.shape[0]).rename('Percentage of defects(%)').round(2)],
    axis=1
)

データ調査.JPG

You can see that everything is an object. Also, starting pitchers, number of spectators, and match time seem to include null.

In this, we will make it possible to analyze dates, scores, and the number of spectators by performing type conversion.

Date processing

--Separate the date column into month and day columns --Make the date column a date type --Calculate the day of the week with "dt.day of week"


data_set["Month"] = data_set["date"].replace("Month.*$", "", regex=True) # regex:Regular expressions
data_set["Day"] = data_set["Day付"].replace("Day.*$", "", regex=True).replace("^.*Month", "", regex=True)

for i in range(len(data_set['Month'])):
    data_set['date'] = "2019/" +  data_set["Month"] + "/" + data_set["Day"]
    
data_set['date'] = pd.to_datetime(data_set['date'])
data_set['Day of the week (number)'] = data_set['date'].dt.dayofweek
data_set.head()

日付.JPG

Audience processing

--Replace "," and "people" --Fill NaN with 0 with fillna --Change type to "int"

data_set["Number of spectators"] = data_set['Number of spectators'].replace(',', '', regex=True).replace('Man', '', regex=True).fillna('0').astype(int)
data_set

観客数.JPG

Score processing

--Split the score into two with str.split --Since the column is divided into two, give the column name and convert it to int type.

data_set_score = pd.concat([data_set, data_set['Score'].str.split(' - ', expand=True)], axis=1)
data_set_score_rename =data_set_score.rename(columns={0:'score', 1:'Conceded'})
data_set_score_rename['score']=data_set_rename['score'].replace('Cancel', 0).astype(int)
data_set_score_rename['Conceded']=data_set_rename['Conceded'].fillna(0).astype(int)

Connect to postgreSQL


connection_config = {
    'user': '*****',
    'password': '*****',
    'host': '*****',
    'port': '****', #OK without
    'database': '*****'
}

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{database}'.format(**connection_config))

Save to postgreSQL

data_set_score_rename.to_sql('giants',schema='baseball_2019', con=engine, if_exists='replace', index=False)

Personally, I stumbled here. What I stumbled upon was that I didn't know how to specify the schema name. I read the official documentation and solved it immediately, but it took about an hour to solve it. When you don't understand, you really don't know.

That's it! If you find any mistakes in the description, please kindly let me know!

Recommended Posts

Scraping, preprocessing and writing to postgreSQL
Scraping using lxml and saving to MySQL
Scraping tabelog with python and outputting to CSV
[Kaggle] From data reading to preprocessing and encoding
Scraping 2 How to scrape
Introduction to Web Scraping
[Introduction to pytorch] Preprocessing by audio I / O and torch audio (> <;)
Introduction to Cython Writing [Notes]
Scraping Go To Travel Accommodation
Connect to Postgresql with GO
Reading and writing csv files
Key additions to pandas 1.1.0 and 1.0.0
Web scraping technology and concerns
React and Flask to GCP
I want to acquire and list Japanese stock data without scraping
Scraping and tabelog ~ I want to find a good restaurant! ~ (Work)
[EC2] Introduction to scraping using selenium (text extraction and screen capture)