[PYTHON] Studying Machine Learning-Pandas Edition-

Studying Machine Learning-Pandas Edition-

Continuing from the last time, this time I learned about Pandas, so I will post it as an output.

1. What is Pandas?

With Pandas, you can format data like Numpy, External library that is also useful for visualization and preprocessing Like Numpy, it's already installed in Anaconda.

2. Import statement

test.ipynb


import pandas as pd

It is troublesome to write the library name each time when using pandas, so let's make it callable with pd.

3. When do you use it?

  1. Data organization
  2. Data visualization
  3. Data preprocessing

All are important elements in machine learning.

4. Main functions

DataFrame function

As the name implies, it is a function that creates a data frame.

test.ipynb


#Self-made ver
test = pd.DataFrame({'culumn1':[1,2,3,4],
                     'culumn2':[3,4,5,6]})

#When entering existing data
data = pd.read_csv('csv file path')

DataFrame arguments are dictionary type. The read_csv function takes a CSV file as an argument and reads the CSV file to create a data frame.

Check data frame

You can check the data frame with the following function.

test.ipynb


#Check the overview
test.info()
#Check the data in the first 3 columns
test.head(3)
#Check the data in the last 3 columns
test.tail(3)
#Check all columns
test.columns
#Check all indexes
test.index

It displays the data of the number of columns of the argument of head and tail. If no argument is described, 5 columns are displayed as default.

Delete and add columns

test.ipynb


#Add column
test['column3'] = [5,6,7,8]
#Delete column
test = test.drop(culumns='culumn3')

To add a column, specify a column name that does not exist in the column and put data in that column. When deleting, specify the column name using the drop function.

Data reference

test.ipynb


#Name the index because the data is confusing
test.index = ['test1','test2','test3','test4']
#Extract specific data
test.loc['test1','culumn1']
#Get a specific column
test.loc['colums1','columns2']

Since it is difficult to understand unless the created data frame index is created, an index is added. Get the corresponding value by specifying the index name and column name. You can also specify only the column name, and it will get the entire column.

Missing value processing

Higher importance ratio in machine learning

test.ipynb


#Import numpy to use nan
import numpy as np

#Create missing data
test = pd.DataFrame({'column1':[1,2,np.nan,4],
                     'column2':[5,np.nan,7,np.nan]})

#Displaying data frames
print(test)
#Check for defects
test.isnull().sum()

Missing data is displayed as NaN. The isnull function can check the missing data, and the missing data is output as 1. If there is a defect, it will be ridiculous when graphing, so I will write how to deal with the defect.

test.ipynb


#Delete missing data
test_dropna = test.dropna()
#Replace missing data
test_fillna = test.fillna(test.mean())

The dropna function drops all indexes with NaN. The fillna function replaces NaN with something else. This time I replaced it with the average of test data. It seems that most of the data is basically replaced with something by the fillna function instead of deleting it. (Unless the learning data is huge and there is no problem even if it is erased a little)

5. Summary

I felt that I could mess up the data organization. I'm surprised that there are many things I can do than Numpy. It's natural because it's an extension of Numpy ...

Numpy is the basis of everything, so remember Pandas and Numpy well.

that's all

Recommended Posts

Studying Machine Learning-Pandas Edition-
Studying Machine Learning ~ matplotlib ~
[Machine learning] Try studying decision trees
[Machine learning] Try studying random forest
Support Vector Machine (for beginners) -Code Edition-
Horse Racing Prediction in Machine Learning-LightGBM Edition-
Introduction to Machine Learning-Hard Margin SVM Edition-