[PYTHON] How to read original data or external data on the Internet with scikit-learn instead of attached data set such as iris

Many books and scikit-learn teaching materials on the Internet use attached datasets such as iris and cancer. Of course, there is a sense of security that the same results can be obtained easily, but I think that many people find it difficult to obtain deep learning in harmony with the schedule. In this article, I will introduce how to read your own data or external data on the net and analyze it with scikit-learn. (Verification environment: Windows10, Anaconda3, Python3.7.6, Jupyter Notebook 6.0.3) First draft released 2020/3/23

CSV file preparation

In this article, as an example, the data published by the machine learning / data science community Kaggle World Happiness Report ← link .com / unsdsn / world-happiness) is used. I chose Kaggle because it requires user registration because it has a dataset that is easy to use for machine learning. Please download from the button Download (79 KB). If you unzip the zip file, you will find 5 CSV files, but here we will use 2019.csv.

When using other files

--2019.csv is arranged as "feature name on the first line, data on the second and subsequent lines" so that it can be easily read by the Python data analysis library ** pandas **. For other data, please mold by deleting the line of Excel.

--If the format is different, such as Excel file (.xls), read it with Excel, etc., then perform "File-Save As" and select the CSV file format. If the delimiter can be selected, leave it as, (comma).

--It is easier to save the file in the folder where the Python executable file (py or ipynb file) is located.

Read CSV file

You can load it directly without using the library, but to make the rest of the process easier, this article will show you how to use pandas. (If pandas is not already installed, [this article] Please refer to (https://www.sejuku.net/blog/75508) etc. )

import pandas as pd
df = pd.read_csv('2019.csv')

If the delimiter is a tab, add the argument sep ='\ t', and if it contains Japanese, add the argument encoding ='shift_jis'. df = pd.read_csv ('filename.csv', sep ='\ t', encoding ='shift_jis')

If you want to put the data file in a different location from the executable file, add df = pd.read_csv ('data / 2019.csv') and a relative path. Reference → Mutual conversion / judgment of absolute path and relative path with Python, pathlib

Confirmation of feature name and number of data

print("Confirmation of dataset key (feature amount name)==>:\n", df.keys())
print('Check the number of rows and columns in the dataframe==>\n', df.shape)

When you run the above command,

Confirmation of dataset key (feature amount name)==>:
 Index(['Overall rank', 'Country or region', 'Score', 'GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption'],
      dtype='object')
Check the number of rows and columns in the dataframe==>
 (156, 9)

It can be confirmed that the data of 156 samples with 9 features can be read.

Process missing values, etc. (feature engineering)

Check if there are any missing values (Null) in the data, and check the data type to see if it is an integer value only (int) / a number including a decimal (float) / a character string or a mixture of a character string and a numerical value (object). I will.

#dataframe Check the number of non-missing data and data type in each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 9 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Overall rank                  156 non-null    int64  
 1   Country or region             156 non-null    object 
 2   Score                         156 non-null    float64
 3   GDP per capita                156 non-null    float64
 4   Social support                156 non-null    float64
 5   Healthy life expectancy       156 non-null    float64
 6   Freedom to make life choices  156 non-null    float64
 7   Generosity                    156 non-null    float64
 8   Perceptions of corruption     156 non-null    float64
dtypes: float64(7), int64(1), object(1)
memory usage: 11.1+ KB

You can confirm that there are 156 non-null data ⇒ ** no missing values **. Overall rank is an integer, Country or region is a character string, and other numbers are numbers including decimals, and the data type is as intended. If the column that should contain the numbers has become an object,

#Extraction of non-numeric type elements
objectlist = df[['Enter the feature name']][df['Enter the feature name'].apply(lambda s:pd.to_numeric(s, errors='coerce')).isnull()]
objectlist

By executing ↑, you can extract the data that is treated as a character string.

There was no mixture of character strings and numbers or missing values in this data, but for various reasons, "blanks", "characters / symbols other than" 0 "meaning zero", and "numbers with units" In many cases, the analysis contains inappropriate values as they are.

Please refer to this article etc. and perform appropriate processing (feature amount engineering).

Create an object (empty dataset) with a data class for Scikit-learn

import sklearn
worldhappiness = sklearn.utils.Bunch()

Change the world happiness part to represent the dataset name.

Put data in the dataset

# 'Score'(Happiness score)The objective variable'target'To
worldhappiness['target'] = df['Score']

#Explanatory variable'data'Put in
worldhappiness['data'] = df.loc[:, ['GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']]

↑ Specify 6 columns other than the first 3 columns (features not used for analysis of objective variables and IDs). It is easy to copy and paste the data output by "Confirmation of feature name".

#If you include the name of the feature, you can use it for the legend of the graph (it is not necessary).
worldhappiness['feature_names'] = ['GDP per capita',
       'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption']

Divided into training set and test set

#Divided into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    worldhappiness['data'], worldhappiness['target'], random_state=0)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (117, 6)
X_test shape: (39, 6)

It was divided into 117 training data and 39 test data. (6 is the number of explanatory variable items)

Conclusion

I think this will advance to mechanical analysis. If you have any mistakes or questions, please feel free to comment.