[PYTHON] Solving Exercises in GCI Data Scientist Training Course Chapter7

GCI Data Scientist Training Course

"GCI Data Scientist Training Course" is offered by the University of Tokyo (Matsuo Laboratory) "* Practical data The contents of the exercise part are published in Jupyter NoteBook format (CC-BY-NC-ND) in the Scientist Training Course and Deep Learning Course </ u> * ". Chapter 7 is "** Data Visualization Using Matplotlib **", where you will learn bar charts, pie charts, bubble charts, and candlestick charts in addition to the scatter charts, line charts, and histograms that you learned in Chapter 2. Instead of the "Like" button for the valuable and wonderful teaching materials that you can learn in Japanese, I will post the answers you have solved. Please point out any mistakes.

The general problem at the end of the chapter was extremely difficult. I wonder if I understand the intent of the problem in the first place ...

Chapter7 Data visualization using Matplotlib

7.1 Data visualization

7.1.1 Basics of data visualization

** <Practice 1> ** Use the student data (student-mat.csv) that you have dealt with before to make a pie chart of the reasons for choosing a school and calculate the percentage of each.

import matplotlib.pyplot as plt
student_data_math = pd.read_csv("student-mat.csv",sep=";")

allCount = len(student_data_math["reason"])
reasonCount = student_data_math["reason"].value_counts()
reasonRatio = reasonCount / allCount
plt.pie(reasonRatio.values, labels=reasonRatio.index, autopct='%1.1f%%', startangle=90, counterclock=False)
plt.axis('equal')

711-1

** <Practice 2> ** With the same data as above, display the average value of the final grade G3 of each math in a bar graph, centered on higher-whether you want to receive a higher education (binary: yes or no). Is there anything you can guess from here?

import matplotlib.pyplot as plt
student_data_math = pd.read_csv("student-mat.csv",sep=";")

x = student_data_math["higher"].unique()
y = []
for x_ in x:
    y.append(student_data_math["G3"][student_data_math["higher"] == x_].mean())

plt.bar(x, y)

711-2

** <Practice 3> With the same data as above, please display the average value of the final grade G3 of each mathematics in a horizontal bar graph with the travel time as the axis. Is there anything you can guess?

import matplotlib.pyplot as plt
student_data_math = pd.read_csv("student-mat.csv",sep=";")

x = student_data_math["traveltime"].unique()
y = []
for x_ in x:
    y.append(student_data_math["G3"][student_data_math["traveltime"] == x_].mean())

plt.barh(x,y)

711-3

7.1.2 Application: Visualization of financial data

7.2 Let's think about how to show the analysis results

7.2.1 Points for creating materials

7.3 Comprehensive problem

** 7.3.1 Time Series Data Analysis ** Here, let's deal with time series data using pandas and scipy that we learned in this chapter.

(1) (Acquisition and confirmation of data) Download dow_jones_index.zip from the following site, read the data using dow_jones_index.data inside, and display the first 5 lines. Also, check the column information of each data and check if there is any null.

https://archive.ics.uci.edu/ml/machine-learning-databases/00312/dow_jones_index.zip

# (1)
# import requests, zipfile
# from io import StringIO
# import io
# zip_file_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00312/dow_jones_index.zip"
# r = requests.get(zip_file_url, stream=True)
# z = zipfile.ZipFile(io.BytesIO(r.content))
# z.extractall()

data = pd.read_csv("dow_jones_index.data")
data[:5]

731-1

(2) (Data processing) Data such as open, high, low, and close in the column have a $ mark in front of the number, so remove it. Also, if the date and time are not read as date type, convert it to date type.

# (2)
data.open = data.open.str.strip("$").astype(float)
data.high = data.high.str.strip("$").astype(float)
data.low = data.low.str.strip("$").astype(float)
data.close = data.close.str.strip("$").astype(float)
data.date = pd.to_datetime(data.date)
data[:5]

731-2

(3) For column close, calculate the summary statistic for each stock.

# (3)
data.groupby("stock").close.describe()

731-3

(4) For column close, issue a correlation matrix that calculates the correlation of each stock. Also, let's draw a heatmap of the correlation matrix using seaborn's heatmap. (Hint: use pandas corr ().)

# (4) 
#Suddenly, the question is unfriendly ...
# "stock"Every"date(Explanatory variable)"Against"close(Objective variable)"When looking at the value of
# "close"Correlate with the transition of"stock"Meaning to look for?
import seaborn as sns
pv = data.pivot(index="date", columns="stock", values="close")
cr = pv.corr()
sns.heatmap(cr, square=True, vmax=1, vmin=-1, center=0)

731-4

(5) Extract the stock combination with the highest correlation coefficient in the correlation matrix calculated in (4). In addition, extract the pair with the highest correlation coefficient and draw a time series graph for each.

# (5)
#What is a pair with a higher correlation coefficient among combinations with a higher correlation coefficient?
#combination=pair?
cr = data.pivot(index="date", columns="stock", values="close").corr()
cr[cr==cr[cr<1].max().max()]
# --
plt.plot(pv.index, pv.CSCO, label="CSCO")
plt.plot(pv.index, pv.MSFT, label="MSFT")
plt.xlabel("date")
plt.ylabel("close")
plt.legend()

731-5-1

731-5-2

(6) Use pandas rolling and window function (window function) to calculate the moving average time series data for the last 5 periods (5 weeks) of close for each stock used above.

# (6)
pv.rolling(5).mean().dropna()

731-6

(7) Use pandas shift () to calculate logarithmic time series data of the ratio of close to the previous period (1 week ago) for each stock used above. In addition, extract the stock with the highest volatility (standard deviation) and the stock with the lowest volatility, and draw a logarithmic rate of change graph.

# (7)
pv_lgrt = np.log(pv/pv.shift()).dropna()
print(pv_lgrt.std()[pv_lgrt.std()==pv_lgrt.std().max()])
print(pv_lgrt.std()[pv_lgrt.std()==pv_lgrt.std().min()])
# --
plt.plot(pv_lgrt.index, pv_lgrt.CSCO)
plt.plot(pv_lgrt.index, pv_lgrt.KO)
> Max stock
> CSCO    0.041205
> dtype: float64
> Min stock
> KO    0.01623
> dtype: float64

731-7

** 7.3.2 Marketing Analysis ** The following is purchasing data that is often used in marketing analysis. Although it is corporate purchasing data that is different from general users, the axis of analysis is basically the same.

(1) Please read the data with pandas from the URL below (it will take some time because the number of data is 500,000 or more and it is relatively large.)

"http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"

(Hint) Use pd.ExcelFile to specify the sheet in .parse ('Online Retail').

Also, since the analysis target this time is only the records that contain data in the Customer ID, please perform the processing for that. In addition, if the Invoice No of the column has a C in front of the number, it will be canceled, so please remove this data. If there is anything else that needs to be removed as data, please handle it appropriately. Below, we will analyze based on this data.

# (1)
import requests
import pandas as pd

# xlsx_file_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
# r = requests.get(xlsx_file_url, stream=True)
# with open("Online%20Retail.xlsx", 'wb') as saveFile:
#     saveFile.write(r.content)

book = pd.ExcelFile("Online%20Retail.xlsx")
data = book.parse("Online Retail")
# --
data_ = data.dropna(subset = ["CustomerID"])
mask = data_['InvoiceNo'].str.startswith('C', na=False)
data_ = data_[~mask]

data_.head()

732-1

(2) Columns of this data include purchase date and time, product name, quantity, number of times, purchaser ID, etc. Here, find the number of unique purchasers (CustomerID), the number of baskets (unique number of InvoiceNo), and the type of product (unique number based on StockCode and Description).

# (2)
print("CustomerID unique:", data_["CustomerID"].nunique())
print("InvoiceNo unique:", data_["InvoiceNo"].nunique())
print("StockCode unique:", data_["StockCode"].nunique())
print("Description unique:", data_["Description"].nunique())
> CustomerID unique: 4339
> InvoiceNo unique: 18536
> StockCode unique: 3665
> Description unique: 3877

(3) There is a Country in this data column. Using this column as the axis, calculate the total purchase amount (amount per unit x total quantity) of each country, arrange them in descending order, and display the results of the top 5 countries.

# (3)
data_["price"] = data_["Quantity"] * data_["UnitPrice"]
data_[["price", "Country"]].groupby("Country").sum().sort_values("price").iloc[-1:-6:-1]

732-3

(4) For the top 5 countries above, please graph the monthly transition of product sales (total amount) in each country. Here, please display the graph separately.

# (4)
top5countries = data_[["price", "Country"]].groupby("Country").sum().sort_values("price").iloc[-1:-6:-1].index.to_list()
data_top5countries = data_[data_["Country"].isin(top5countries)]

data_dict={}
i=0
f, axs = plt.subplots(5,1,figsize=(6,10))
for country, df in data_top5countries.groupby("Country"):
    df.index = df.InvoiceDate
    df_ = df.resample("M").sum()
    axs[i].plot(df_.index, df_.price, label=country)
    axs[i].legend(loc="best")
    data_dict[country] = df
    i+=1

732-4

(5) For the top 5 countries above, please extract the top 5 products by product sales in each country. Also, make them a pie chart by country. In addition, please total the products based on "Description".

# (5)
i=0
f, axs = plt.subplots(5,1,figsize=(5,15))
for country in top5countries:
    top5descriptionData = data_dict[country][["price", "Description"]].groupby("Description").sum().sort_values("price").iloc[-1:-6:-1]
    print(country, "\n", top5descriptionData, "\n")
    axs[i].pie(top5descriptionData["price"], labels=top5descriptionData.index.to_list(), autopct='%1.1f%%', startangle=90, counterclock=False)
    axs[i].title.set_text(country)
    i+=1
> United Kingdom 
>                                          price
> Description                                  
> PAPER CRAFT , LITTLE BIRDIE         168469.60
> REGENCY CAKESTAND 3 TIER            110990.20
> WHITE HANGING HEART T-LIGHT HOLDER   94858.60
> MEDIUM CERAMIC TOP STORAGE JAR       80291.44
> JUMBO BAG RED RETROSPOT              77371.57 
> 
> Netherlands 
>                                         price
> Description                                 
> RABBIT NIGHT LIGHT                   9568.48
> ROUND SNACK BOXES SET OF4 WOODLAND   7991.40
> SPACEBOY LUNCH BOX                   7485.60
> DOLLY GIRL LUNCH BOX                 6828.60
> ROUND SNACK BOXES SET OF 4 FRUITS    4039.20 
> 
> EIRE 
>                                   price
> Description                           
> REGENCY CAKESTAND 3 TIER       7388.55
> CARRIAGE                       4875.00
> 3 TIER CAKE TIN RED AND CREAM  4235.65
> Manual                         3374.34
> JAM MAKING SET WITH JARS       2976.00 
> 
> Germany 
>                                          price
> Description                                  
> POSTAGE                              21001.00
> REGENCY CAKESTAND 3 TIER              9061.95
> ROUND SNACK BOXES SET OF4 WOODLAND    3598.95
> Manual                                2296.25
> ROUND SNACK BOXES SET OF 4 FRUITS     1982.40 
> 
> France 
>                                    price
> Description                            
> POSTAGE                        15454.00
> Manual                          9492.37
> RABBIT NIGHT LIGHT              7234.24
> REGENCY CAKESTAND 3 TIER        2816.85
> RED TOADSTOOL LED NIGHT LIGHT   2130.15 

732-5

Recommended Posts

Solving Exercises in GCI Data Scientist Training Course Chapter6
Solving Exercises in GCI Data Scientist Training Course Chapter7
Solving Exercises in GCI Data Scientist Training Course Chapter8
Data Scientist Training Course Chapter 2 Day 2
Data Scientist Training Course Chapter 3 Day 3
Data Scientist Training Course Chapter 4 Day 1
Data Scientist Training Course Chapter 3 Day 1 + 2