[PYTHON] Make a note of the list of basic Pandas usage

Make a note of the list of basic Pandas usage

Pandas is a must-have for machine learning-related work using Python, but I often forget how to use it, so I made a note of how to use frequently used functions. In the future, I would like to update the operations I learned using Pandas as a separate article. I hope it will be helpful for those who have started using Pandas and those who want to check the operation a little.

Since this is a memorandum for beginners, there is a possibility that the content may be incorrect. If you find any mistakes, we would appreciate it if you could contact us.

Operating environment

Operation contents introduced in the article

The following operation methods are described in this article.

  1. Basic operation of Pandas --[Import Library](#Import Library) --[Read csv file](Read #csv file) --[Export to csv file](Export to #csv file) -[Check data type](#Check data type) -[Display the number of data](#Display the number of data) -[Check the number of missing data](#Check the number of missing data) -[Check the basic statistics of the data](#Check the basic statistics of the data) -[Perform one-hot-encoding for category data](# Perform one-hot-encoding for category data) --[Add label and data](#Add label and data) --[Delete Label](#Delete Label) -[Fill missing data with specified value](#Fill missing data with specified value)
  2. Use Pandas conveniently -[Extract data by specifying conditions](# Extract data by specifying conditions) -[Change data by specifying conditions](#Change data by specifying conditions) * Warning is issued, so improvement is required. -[Perform processing for groups with groupby](Perform processing for groups with #groupby)

Summary

Here are some of the features I use most often in Pandas. However, although the following are working for the time being, I am still using it with a moody understanding, so I will investigate and summarize it at another time. -[Change data by specifying conditions](#Change data by specifying conditions) -[Perform processing for groups with groupby](Perform processing for groups with #groupby)

1. Basic operation of Pandas

Import the Pandas library.

import pandas as pd 

Use the read_csv method to read the csv file as a DataFrame object. This time, I'm reading a file called "student.csv" in my working directory.

data = pd.read_csv("student.csv")
display(data.head(5))

スクリーンショット 2020-01-05 20.36.47.png

[** Supplement 1: When reading csv without header **] If the csv file does not contain headers ("sex", "age", "height", "weight"), the first data (NaN, 13, 151.7, 59.1) will be read as headers. Since it will end up, specify * header = None *.

To write the DataFrame object to a csv file, use the to_csv method. In the example, it is saved in the working directory with the file name "student_out.csv".

data.to_csv("student_out.csv", index=False)

Specify * index = False * to avoid saving the index (label of the data) when saving. If you don't know what it is, you can check the csv file generated without * index = False *.

To see the types of data contained in a DataFrame, look at the dtypes attribute of the DataFrame object.

display(data.dtypes)

The result is as follows.

スクリーンショット 2020-01-05 20.38.53.png

To get the data type for each label, do as follows.

display(data["age"].dtype )

スクリーンショット 2020-01-05 20.40.20.png

Use the count method to display the number of data. The number of data is 1000, but "sex" is less than 1000 because missing data is not counted.

display(data.count())

スクリーンショット 2020-01-05 20.42.07.png

[** Supplement 1: Get the number of data for each label **] To get the number of data for each label, do as follows.

display(data["sex"].count())

スクリーンショット 2020-01-05 20.43.13.png

To check the number of missing data, use the isnull method and the sum method.

display(data.isnull().sum())

スクリーンショット 2020-01-05 20.44.05.png

[** Supplement 1: Operation of isnull method **] According to the official documentation, the isnull method returns a DataFrame object of the same size as the original DataFrame, with ** None ** and ** numpy.NaN ** set to True and the others set to False.

display(data.isnull().head(5))

スクリーンショット 2020-01-05 20.45.03.png

[** Supplement 2: operation of sum method **] The sum method returns the sum for the specified axis. In Python, True is treated as 1 and False is treated as 0, so the total value is the number of True (the number of missing data). The following is a quote from Reference 4. The quote is Python 3.8.1, but I think it is the same for other vers. .. .. Perhaps. .. .. I haven't confirmed it. .. ..

Boolean values are two constant objects False and True. These are used to represent truth values (although other values are also considered false or true). ** In a numeric processing context (for example, when used as an argument to an arithmetic operator), they behave like 0 and 1, respectively. ** For any value, if it can be interpreted as a truth value, the built-in function bool () is used to convert the value to a Boolean value (see the Truth Value Determination section above).

[reference]

  1. Pandas official document isnull ()
  2. Pandas official document sum ()
  3. Count the number of elements that meet certain conditions with note.nkmk.me pandas
  4. Python official documentation built-in type

To see the rough statistical data of the data contained in the DataFrame, use the describe method. The describe method is executed ignoring NaN.

display(data.describe(include="all"))

スクリーンショット 2020-01-05 20.49.08.png

[** Supplement 1: Aggregate data other than numerical data **] By default, only numerical data is aggregated, so ** include = "all" ** is also specified and executed for "sex". Also, please note that the contents of the aggregated statistics differ between numerical data and other data.

Reference 1: pandas official document describe ()

Use the get_dummies method to perform one-hot-encoding. The following is an example of performing one-hot-encoding for "sex".

# one-hot-Perform encoding.
dummydf_sex = pd.get_dummies(data, columns=["sex"], dummy_na=True)

#Original data
display(data.head(5))
# one-hot-encoding data
display(dummydf_sex.head(5))

スクリーンショット 2020-01-05 20.50.28.png

In this way, when one-hot-encoding is executed, new labels ("sex_male", "sex_female", "sex_nan") of "sex" data ("male", "female", "Nan") are created. And 0 shows what the original data was.

[** Supplement 1: Treat missing data as a label **] By default, missing data (NaN) is ignored, but "data is missing" is also good information, so specify * dummy_na = True * in the argument of the get_dummies method and also "NaN" One-hot-encoding is performed as one data.

[reference]

  1. Pandas official document get_dummies ()

Try adding new labels and data to the DataFrame object. Let's create a BMI label as an example. An easy way is to (step 1) create a list of data and (step 2) add it as a new label, as shown below.

#Step 1: Create a list of BMIs. It has nothing to do with Pandas operations.
bmi = [ w * (h / 100)**2 for w, h in zip(data["weight"], data["height"]) ]

#Step 2: List BMI"bmi"Add it as label data.
data["bmi"] = bmi

Let's display the result.

display(data.head(5))

スクリーンショット 2020-01-05 20.53.23.png

You can see that the "bmi" label and data have been added to the DataFrame.

[** Supplement 1: How to use the assign method **] You can also add a label using the assign method. In the assign method, you can also create data with a function for creating data. Let's try adding a label for proper weight (proper_weight).

data = data.assign(proper_weight = lambda x : (x.height / 100.0)**2 * 22)
display(data.head(5))

スクリーンショット 2020-01-05 20.54.46.png

[reference]

  1. Pandas Official Document Tutorial Setting
  2. Pandas Official Document assign ()

Delete the label and the data it contains. To delete the "bmi" added by , do as follows.

#Remove label
data.drop(columns=["bmi"], inplace=True)

Let's display the result.

スクリーンショット 2020-01-05 20.56.53.png

[** Supplement 1: Reflect changes in the original DataFrame **] By default, the drop method returns a DataFrame object and makes no changes to the original DataFrame object. You can reflect the changes in the original DataFrame object by specifying * place = True *.

[reference]

  1. Pandas Official Document drop ()

Use the fillna method to fill the missing data with the specified value. Let's fill in the missing part of "sex" with "unknown".

data.fillna(value={"sex": "unknown"}, inplace=True)

Let's check the result.

display(data.head(5))

スクリーンショット 2020-01-05 20.58.49.png

I was able to confirm that "unknown" was entered in the missing part of sex.

[reference]

  1. Pandas Official Document fillna ()

2. Use Pandas conveniently

Specify the conditions and try to extract the data that matches the conditions. As an example, let's create a new DataFrame object (data_over) by extracting data that weighs more than the proper weight (proper_weight).

data_over = data[data.weight > data.proper_weight]
display(data_over.head(5))

スクリーンショット 2020-01-05 20.59.55.png

[reference]

  1. Pandas Official Document Tutorial Boolean Indexing
  2. Pandas Official Document Indexing and selecting data

Specify a condition to change only the data that matches the condition. As an example, let's set the weight of data with a height of 150 or less to 0.

data_over["weight"][data_over.height <= 150] = 0
display(data_over.head(5))

スクリーンショット 2020-01-05 21.01.12.png

I've done what I want to do for the time being, but I'm getting a Warning. I did a quick research on this, but I couldn't fully understand it, so I'd like to investigate it and write an article at a later date.

Use the groupby method when you want to process data that matches the conditions as a group in group units. As an example, let's output the average value for each gender.

display(data.groupby("sex").mean())

スクリーンショット 2020-01-05 21.07.07.png

[reference]

  1. Pandas Official Document groupby ()
  2. Pandas Official Document GroupBy Object
  3. Qiita: How to use Pandas groupby

Reference: Pandas Official Document

Pandas Official Document

Recommended Posts

Make a note of the list of basic Pandas usage
Make a copy of the list in Python
Basic usage of Pandas Summary
A note on customizing the dict list class
A note about the python version of python virtualenv
Basic operation of pandas
Basic usage of flask-classy
Basic usage of Jinja2
Basic operation of Pandas
Basic usage of SQLAlchemy
Since the fledgling engineer passed the G test, make a note of the learning content
Make a BOT that shortens the URL of Discord
[Introduction to Python] Basic usage of the library matplotlib
Python Note: The mystery of assigning a variable to a variable
Super basic usage of pytest
Basic usage of PySimple GUI
Basic usage of Python f-string
Index of certain pandas usage
The Power of Pandas: Python
A note on the default behavior of collate_fn in PyTorch
[Python] How to make a list of character strings character by character
Get the id of a GPU with low memory usage
Get the number of specific elements in a python list
Put the lists together in pandas to make a DataFrame
[Note] Import of a file in the parent directory in Python
Extract the value of dict or list as a string
How to connect the contents of a list into a string
Make a note of what you want to do in the future with Raspberry Pi
[Note] A shell script that checks the CPU usage of a specific process in a while loop.
The day of docker run (note)
A pharmaceutical company researcher summarized the basic description rules of Python
pandas Fetch the name of a column that contains a specific character
If you give a list with the default argument of the function ...
A small sample note of list_head
python / Make a dict from a list.
How to find the memory address of a Pandas dataframe value
[Python] Make the function a lambda function
About the basic type of Go
I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
Dig the directory and create a list of directory paths + file names
python note: map -do the same for each element of the list
How to make a crawler --Basic
Addictive note: max (max (list)) must not be used when maxing the value of a 2D array
Generate a list of consecutive characters
About the basics list of Python basics
A memorandum regarding the acquisition of the Python3 engineer certification basic exam
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
[python] A note that started to understand the behavior of matplotlib.pyplot
The story of writing a program
[Python] A program that rotates the contents of the list to the left
Note: List of customized special names
[Fabric] I was addicted to using boolean as an argument, so make a note of the countermeasures.
Make a list of "Houbunsha 70th Anniversary Campaign" on Selenium in Amazon
Generate a list packed with the number of days in the current month.
[Introduction to Python] How to sort the contents of a list efficiently with list sort
[Linux] Command to get a list of commands executed in the past
I want to sort a list in the order of other lists
Receive a list of the results of parallel processing in Python with starmap
A note about the functions of the Linux standard library that handles time
Make a DNN-CRF with Chainer and recognize the chord progression of music
I made a mistake in fetching the hierarchy with MultiIndex of pandas