I always wanted to analyze using baseball data, so I always "copy the necessary parts of the site, paste it into excel, save it as csv". However, I thought it was a hassle, so I tried to "scrap and preprocess and write to postgreSQL".

Installation of required libraries

Scraping etc.


import pandas as pd
from bs4 import BeautifulSoup
import datetime

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

postgreSQL write

import psycopg2
from sqlalchemy import create_engine

psycopg2
https://www.psycopg.org/docs/install.html#installation
sqlalchemy
https://docs.sqlalchemy.org/en/13/core/engines.html

Data acquisition

This time, we will get the performance data of the giant batter with "pandas.read_html". It seems that the table (table tag) of the page is fetched in order and the list of dataframe is returned.

pandas.read_html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

raw_data = pd.read_html("https://baseball-freak.com/audience/giants.html", flavor="bs4")

The list will be returned, so it will look like this スクレイピング結果.JPG

The first data frame looks like this. 1つめのdataframe.JPG

The second one should contain the data you want this time, so check it.

data_set = raw_data[1] #There are multiple tables on the page, and this time I want the second table[1]Specify
data_set

Check the contents of the data frame

Check the contents of the acquired data.

pd.concat(
    [data_set.count().rename('number'),
    data_set.dtypes.rename('Data type'),
    data_set.nunique().rename('Unique number'),
    data_set.isnull().sum().rename('Number of nulls'),
    (data_set.isnull().sum() * 100 / data_set.shape[0]).rename('Percentage of defects(%)').round(2)],
    axis=1
)

データ調査.JPG

You can see that everything is an object. Also, starting pitchers, number of spectators, and match time seem to include null.

In this, we will make it possible to analyze dates, scores, and the number of spectators by performing type conversion.

Date processing

--Separate the date column into month and day columns --Make the date column a date type --Calculate the day of the week with "dt.day of week"

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.dayofweek.html


data_set["Month"] = data_set["date"].replace("Month.*$", "", regex=True) # regex:Regular expressions
data_set["Day"] = data_set["Day付"].replace("Day.*$", "", regex=True).replace("^.*Month", "", regex=True)

for i in range(len(data_set['Month'])):
    data_set['date'] = "2019/" +  data_set["Month"] + "/" + data_set["Day"]
    
data_set['date'] = pd.to_datetime(data_set['date'])
data_set['Day of the week (number)'] = data_set['date'].dt.dayofweek
data_set.head()

日付.JPG

Audience processing

--Replace "," and "people" --Fill NaN with 0 with fillna --Change type to "int"

data_set["Number of spectators"] = data_set['Number of spectators'].replace(',', '', regex=True).replace('Man', '', regex=True).fillna('0').astype(int)
data_set

観客数.JPG

Score processing

--Split the score into two with str.split --Since the column is divided into two, give the column name and convert it to int type.

data_set_score = pd.concat([data_set, data_set['Score'].str.split(' - ', expand=True)], axis=1)
data_set_score_rename =data_set_score.rename(columns={0:'score', 1:'Conceded'})
data_set_score_rename['score']=data_set_rename['score'].replace('Cancel', 0).astype(int)
data_set_score_rename['Conceded']=data_set_rename['Conceded'].fillna(0).astype(int)

Connect to postgreSQL


connection_config = {
    'user': '*****',
    'password': '*****',
    'host': '*****',
    'port': '****', #OK without
    'database': '*****'
}

engine = create_engine('postgresql://{user}:{password}@{host}:{port}/{database}'.format(**connection_config))

Save to postgreSQL

DataFrame.to_sql
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

data_set_score_rename.to_sql('giants',schema='baseball_2019', con=engine, if_exists='replace', index=False)

Personally, I stumbled here. What I stumbled upon was that I didn't know how to specify the schema name. I read the official documentation and solved it immediately, but it took about an hour to solve it. When you don't understand, you really don't know.

That's it! If you find any mistakes in the description, please kindly let me know!

[PYTHON] Scraping, preprocessing and writing to postgreSQL