[PYTHON] [Spotify] Looking back on 2020 with playlists --Part.2 EDA (basic statistics), data preprocessing

Introduction

The Spotify API returns the analyzed musical parameters for each song. Get a list of songs that include those data and analyze the trends of the songs you listened to in 2020. (It's new year ...)

スクリーンショット 2020-12-28 23.42.22.png

Operating environment

Things to prepare in advance

--Spotify Playlist CSV-See Part 1 Article --Installation of Exploratory- Install for free I can do it

things to do

  1. EDA-Data Understanding
  2. Data preprocessing ① --Add a new column as a label value
  3. Data preprocessing ② --Type conversion (time item m seconds → h: M: S)

What is EDA

Exploratory data analysis. Abbreviation for Explanatory Data Analysis. EDA is the very first phase of data analysis, with the goal of first touching the data, visualizing the data, looking for patterns, and understanding the relationships/correlations of features and targets.

Why do you need it?

Before starting the analysis, it is important to first understand "what kind of data set you are dealing with".

Feature engineering is often required to build more advanced machine learning models and solve difficult problems, which requires deep data knowledge and understanding. Also, know which columns need to be preprocessed at this stage.

How to do

You can use pandas for the next step, but if you're a python beginner, you may get stuck writing code. First of all, I think it would be nice if you could get an image "from the visual", so I will introduce a method that can be done without code.

Try to capture the data

The tool to use is Exploratory.

Simply import the CSV data and the summary statistics for each column will be displayed, including the presence or absence of missing values ​​for each item, as shown below. It's convenient!

image.png

Correlation of column values ​​can also be visualized by GUI operation. You can see the positive correlation between loudness and energy.

image.png

What is pretreatment

The reasons why pretreatment is necessary are as follows. --Because the machine learning model needs to be passed as numerical data instead of string data --Similar to the above, data with missing values ​​(null) cannot be passed to the machine learning model without conversion. --Exclude outlier records to improve accuracy, etc.

For example, what?

--Machine learning models are passed as numerical data instead of string data --Example. For numeric data (0,1,2 ...) instead of string data (day of the week: Mon, Tue, Wed ...)

--Exclude outlier records to improve accuracy --Example: Check if there is a silent track with a long number of seconds on the secret track --Example. Check if there is a song with double tempo (BPM)

Although

The data that can be acquired by the provided API has no missing values ​​and is in a form that can be easily imported by the machine. Therefore, it is not very suitable as a pretreatment study material.

Also, tempo (BPM), key (key), and time_signitune (beat) are not always decided by one song. In the first place, it is necessary to consider whether it should be an item to be analyzed. This is a point to check during EDA.

I would like to summarize a specific example of preprocessing in a separate article. In this article, instead of pre-processing, we will put a process to add a human-readable value to another column during EDA.

python


#In a separate column with the key as the label value: D major(major)Is 1,Monotonous(minor)is 0
tracks_with_features_df.loc[tracks_with_features_df['mode'] == 1, 'a_mode'] = 'major'
tracks_with_features_df.loc[tracks_with_features_df['mode'] == 0, 'a_mode'] = 'minor'

#Key as label value in another column: C is 1, C#Is 2...  
tracks_with_features_df.loc[tracks_with_features_df['key'] == 0, 'a_key'] = 'C'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 1, 'a_key'] = 'C#'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 2, 'a_key'] = 'D'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 3, 'a_key'] = 'D#'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 4, 'a_key'] = 'E'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 5, 'a_key'] = 'F'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 6, 'a_key'] = 'F#'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 7, 'a_key'] = 'G'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 8, 'a_key'] = 'G#'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 9, 'a_key'] = 'A'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 10, 'a_key'] = 'A#'
tracks_with_features_df.loc[tracks_with_features_df['key'] == 11, 'a_key'] = 'B'

#Conversion in hours: milliseconds → seconds
tracks_with_features_df['a_second'] = tracks_with_features_df['duration_ms'] / 1000

Summary

We used the Spotify API to show the data visualization and preprocessing of the audio data of a song. Next time, we will visualize the similarity of songs from audio data. Well then.

Recommended Posts

[Spotify] Looking back on 2020 with playlists --Part.2 EDA (basic statistics), data preprocessing
Looking back on the data M-1 Grand Prix 2020
Looking back on learning with Azure Machine Learning Studio
Looking back on creating a web service with Django 2
Looking back on ABC155
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
Looking back on iOS'Healthcare App' 2019