I tried to analyze J League data with Python

This time, I downloaded J League data from a soccer data site called FootyStats and examined it with Python. However, since the csv data for the 2020 season could not be downloaded yet, I am using the data for 2019. This site deals with information not only on the J League but also on leagues around the world, so it's interesting just to look at the site.

For J-League (J1, J2, J, Cup match) data, match data, team data, and player data can be downloaded respectively. (The standings are from J.LEAGUE Data Site).

#Import various libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

2019 ranking

Since there was no point column in the FootyStats data, the URL was directly obtained from the site J.LEAGUE Data Site and displayed.

#Capture data
j1_rank = 'https://data.j-league.or.jp/SFRT01/?search=search&yearId=2019&yearIdLabel=2019%E5%B9%B4&competitionId=460&competitionIdLabel=%E6%98%8E%E6%B2%BB%E5%AE%89%E7%94%B0%E7%94%9F%E5%91%BD%EF%BC%AA%EF%BC%91%E3%83%AA%E3%83%BC%E3%82%B0&competitionSectionId=0&competitionSectionIdLabel=%E6%9C%80%E6%96%B0%E7%AF%80&homeAwayFlg=3'
j1_rank = pd.read_html(j1_rank)
df_rank = pd.DataFrame(j1_rank[0])
df_rank.index = df_rank.index + 1
df_rank[['team','Points', 'Win', 'Minutes', 'Loss', 'score', 'Conceded', '得Conceded差']]

スクリーンショット 2020-12-31 10.44.59.png

Take a quick look at the basic statistics

df_rank.describe()

The average points are 47 points and the median is 46.5 points, so the average and median are almost the same. If you check the standings, the points 46-47 are concentrated in the middle or slightly above the 10th-6th.

スクリーンショット 2020-12-31 10.04.05.png

Read Footy Stats data

Load the csv data for this analysis. Looking at the team data, there were also 293 columns. It is quite difficult to check what kind of data is available. .. ..

df_team = pd.read_csv('j1_team_2019.csv')
pd.set_option('display.max_columns', None)
df_team.head(6)

スクリーンショット 2020-12-30 18.48.23.png


len(df_teams.columns)
#Number of columns: 293

Take a quick look at what kind of data you have

You can also display it with df.culumns (), but it is easier to see personally if you turn it in for minutes.

for team in df_team:
    print(team)

スクリーンショット 2020-12-31 9.57.25.png

スクリーンショット 2020-12-31 9.58.36.png

Not all items are listed, but it seems that very detailed data such as the score rate in the first and second half of the game is included.

Win rate

I summarized the winning percentage of the J1 team. Of the games I won, I wanted to see how many homes I won at home, so I wanted to see the clubs that are strong at home, so I made a line of "home rate out of wins" and arranged them in descending order.

#Win rate
df_team['wins_rate'] = df_team.apply(lambda row: row['wins'] / 34, axis=1)
#Home win rate
df_team['home_wins_rate'] = df_team.apply(lambda row: row['wins_home'] / 17, axis=1)
#Home out of victory
df_team['wins_at_home'] = df_team.apply(lambda row: row['wins_home'] / row['wins'], axis=1)
df_team = df_team[['team_name', 'wins', 'wins_home','wins_rate', 'home_wins_rate', 'wins_at_home']].sort_values('wins_at_home', ascending=False).reset_index(drop=True)
df_team.index = df_team.index + 1
df_team.rename(columns={'team_name': 'Club name', 'wins': 'victory', 'wins_home': 'ホームvictory', 'wins_rate': 'Win rate', 'home_wins_rate': 'ホームWin rate', 'wins_at_home': 'victoryのうちホーム率'})

スクリーンショット 2020-12-30 18.23.24.png

Nagoya wins 9 times a year, and 7 of them are at home (about 78%) and at home. Sendai is also expensive. Looking down, Kawasaki, who has been competing for victory in recent years, was surprisingly low.

Correlation between the number of points and the number of wins

Obviously, the more points you have, the more games you will win, but let's look at the correlation between the number of points and the number of wins. Plot the number of points on the horizontal axis and the number of wins on the vertical axis. As you can see, there is still a positive correlation.

df = df_team
plt.scatter(df['goals_scored'], df['wins'])
plt.xlabel('goals_scored')
plt.ylabel('wins')

スクリーンショット 2020-12-30 18.34.28.png

Display the team name and take a look

for i, txt in enumerate(df.team_name):
    plt.annotate(txt, (df['goals_scored'].values[i], df['wins'].values[i]))
    print(txt)

plt.scatter(df['goals_scored'], df['wins'])
plt.xlabel('goals_scored')
plt.ylabel('wins')
plt.show()

スクリーンショット 2020-12-31 11.46.08.png

Correlation between the number of goals conceded and the number of wins

On the contrary, let's look at the correlation between the number of goals conceded and the number of wins. There seems to be a correlation (negative correlation) here as well, but it does not seem to be as strong as the correlation between the number of points scored and the number of wins.

df = df_team
plt.scatter(df['goals_conceded'], df['wins'])
plt.xlabel('goals_conceded')
plt.ylabel('wins')

スクリーンショット 2020-12-30 18.35.26.png

for i, txt in enumerate(df.team_name):
    plt.annotate(txt, (df['goals_conceded'].values[i], df['wins'].values[i]))
    print(txt)

plt.scatter(df['goals_conceded'], df['wins'])
plt.xlabel('goals_conceded')
plt.ylabel('wins')
plt.show()

Show team name

スクリーンショット 2020-12-31 11.48.31.png

Correlation coefficient

Let's find the correlation coefficient between the number of points and the number of wins, and the number of goals and the number of wins.

Correlation coefficient between score and number of wins

wins = df['wins']
goals_scored = df['goals_scored']
r = np.corrcoef(wins, goals_scored)
r
#Correlation coefficient: 0.7184946

Correlation coefficient between goals and wins

wins = df['wins']
goals_conceded = df['goals_conceded']
r = np.corrcoef(wins, goals_conceded)
r
#Correlation coefficient:-0.58795491

The correlation coefficient between the number of points and the number of wins is still high at about 0.72. The correlation coefficient between the number of goals conceded and the number of wins is about -0.58 (absolute value 0.58), which seems to be correlated, but not as much as the number of points scored.

I may add it because I am analyzing various other things. Also, when the data for the 2020 season becomes available for download, we plan to take a look at the 2020 season as well. Due to the influence of Corona, the schedule has become overcrowded, and the rules for replacement slots have changed, so I would like to compare how it has changed from normal.

Recommended Posts

I tried to analyze J League data with Python
I tried to get CloudWatch data with Python
[Pandas] I tried to analyze sales data with Python [For beginners]
I want to be able to analyze data with Python (Part 3)
I tried to make various "dummy data" with Python faker
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
I want to analyze logs with Python
I tried to save the data with discord
I tried to output LLVM IR with Python
I tried to automate sushi making with python
I tried fp-growth with python
[Data science basics] I tried saving from csv to mysql with python
I tried fMRI data analysis with python (Introduction to brain information decoding)
I tried gRPC with Python
I tried scraping with python
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to draw a route map with Python
I tried to get started with blender python script_Part 02
I tried to implement an artificial perceptron with python
I tried to automatically generate a password with Python3
I tried to solve the problem with Python Vol.1
I tried to solve AOJ's number theory with Python
I tried web scraping with python.
I want to debug with Python
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried various methods to send Japanese mail with Python
[Python] I tried to visualize tweets about Corona with WordCloud
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
I tried to divide the file into folders with Python
I tried to aggregate & compare unit price data by language with Real Gachi by Python
I tried scraping food recall information with Python to create a pandas data frame
[5th] I tried to make a certain authenticator-like tool with python
I tried to summarize Python exception handling
I tried to implement PLSA in Python
I tried to solve the ant book beginner's edition with python
I tried to implement Autoencoder with TensorFlow
I tried to implement permutation in Python
[2nd] I tried to make a certain authenticator-like tool with python
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried scraping Yahoo News with Python
I tried to implement PLSA in Python 2
[3rd] I tried to make a certain authenticator-like tool with python
[Python] A memo that I tried to get started with asyncio
Convert Excel data to JSON with python
Python3 standard input I tried to summarize
I wanted to solve ABC160 with Python
I tried sending an email with python.
I tried to create a list of prime numbers with python
I tried non-photorealistic rendering with Python + opencv
I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action
Convert FX 1-minute data to 5-minute data with Python
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)