[PYTHON] Getting Started with pandas: Basic Knowledge to Remember First

There are many operations to remember in pandas, so I just want to remember this for the time being! I personally summarized the contents.

If you are going to touch pandas from now on, please refer to it.

Series / DataFrame / Panel

-** Series : Represents one-dimensional data. - DataFrame : Represents two-dimensional data. - Panel **: Represents 3D data.

In mathematics, the vector is Series and the matrix is DataFrame.

If the program has pd.DataFrame (), it can be interpreted as "I'm creating two-dimensional data."

Basic operation of generation

Import is done as follows.

test.py


import pandas as pd

All of the code below omits import pandas as pd, but think of it as being imported.

Let's create a Series first.

test.py


cp = [100,200,300,400]
print(pd.Series(cp))
# 0    100
# 1    200
# 2    300
# 3    400
# dtype: int64

Unlike the list, the index is automatically added when you output it. You can also define this index yourself.

test.py


cp = [100,200,300,400]
cp_index = ["Jan","Feb","Mar","Apr"]
print(pd.Series(cp, index=cp_index))
# Jan    100
# Feb    200
# Mar    300
# Apr    400
# dtype: int64

Next, create a DataFrame.

test.py


cp = {
	"a": [400,300,200,100],
	"b": [100,200,300,400]
}
cp_index = ["Jan","Feb","Mar","Apr"]
print(pd.DataFrame(cp, index=cp_index))
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300
# Apr  100  400

DataFrame has become a table shape that you often see in Excel.

In DataFrame, not only the index but also the columns (a and b) are specified. Therefore, the variable cp is defined by dictionary.

You can also do the following:

test.py


df=pd.DataFrame([[400,100],
              [300,200],
              [200,300],
              [100,400]],
              index=['Jan', 'Feb','Mar','Apr'],
              columns=["a","b"])
print(df)
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300
# Apr  100  400

However, in reality, you will often read the csv file instead of writing the data yourself.

So, I will introduce how to read data from csv file.

sample.csv


,a,b
Jan,400,100
Feb,300,200
Mar,200,300
Apr,100,400

Use read_csv () to read the csv file.

test.py


df=pd.read_csv("sample.csv",index_col=[0])
print(df)
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300
# Apr  100  400

In index_col, specify the number of columns to index in a list type.

Basic extraction operation

In the source code below, df uses the same as above.

↓ If you want to extract 3 lines from the top

test.py


print(df.head(3))
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300

↓ Extract by specifying a column

test.py


print(df["a"])
# Jan    400
# Feb    300
# Mar    200
# Apr    100
# Name: a, dtype: int64

↓ Extract by specifying the index as a slice

test.py


print(df[1:3])
#        a    b
# Feb  300  200
# Mar  200  300

Difference between loc, iloc and ix

When extracting from a DataFrame, loc, iloc, ix are often used. All have the same function of "extract by specifying rows and columns".

So what's the difference?

loc

loc specifies rows and columns with ** label name **.

test.py


print(df.loc[["Feb","Mar"]])
#        a    b
# Feb  300  200
# Mar  200  300

print(df.loc[:,["a"]])
#        a
# Jan  400
# Feb  300
# Mar  200
# Apr  100

By the way, the colon in df.loc [:, ["a"]] means "extract all rows".

iloc

iloc specifies rows and columns by ** number **.

test.py


print(df.iloc[[1,3]])
#        a    b
# Feb  300  200
# Apr  100  400

print(df.iloc[:,[0]])
#        a
# Jan  400
# Feb  300
# Mar  200
# Apr  100

print(df.iloc[[1,3],[0]])
#        a
# Feb  300
# Apr  100

print(df.iloc[1:3])
#        a    b
# Feb  300  200
# Mar  200  300

Is i in iloc an i in integer?

ix

For ix, you can specify the row and column by label name, or you can specify by number.

However, if the index or column is an integer type, confusion will occur and it will be confusing, so it seems better to use loc and iloc properly without using ix.

Other processes that are likely to be used

Here are some of the processes that are likely to be used personally.

test.py


#Specify conditions to set the contents of the DataFrame to True and False
print(df >= 300)
#        a       b
# Jan  True   False
# Feb  True   False
# Mar  False   True
# Apr  False   True

#Make numbers less than 300 NaN
print(df[df >= 300])
#       a     b
# Jan  400   NaN
# Feb  300   NaN
# Mar  NaN   300
# Apr  NaN   400

#Fill the missing value NaN with 0
print(df[df >= 300].fillna(0))
#       a     b
# Jan  400    0
# Feb  300    0
# Mar   0    300
# Apr   0    400

#Store the missing value NaN as the average value for each column
print(df[df >= 300].fillna(df.mean()))
#       a     b
# Jan  400   250
# Feb  300   250
# Mar  250   300
# Apr  250   400

At the end

Before studying in earnest, this is all! I summarized that.

I think I remembered this and finally set up at the starting point.

I'm also a pandas beginner, so I'll do my best to study.

Recommended Posts

Getting Started with pandas: Basic Knowledge to Remember First
Getting Started with python3 # 1 Learn Basic Knowledge
Materials to read when getting started with Python
Getting started with Android!
Convert 202003 to 2020-03 with pandas
Getting Started with Golang 2
Getting started with apache2
Getting Started with Golang 1
Getting Started with Python
Getting Started with Django 1
Getting Started with Optimization
Getting Started with Golang 3
Getting Started with Numpy
Getting started with Spark
Getting Started with Python
Getting Started with Pydantic
Getting Started with Golang 4
Getting Started with Jython
Getting Started with Django 2
Materials to read when getting started with Apache Beam
Getting Started with Python Functions
Getting Started with Tkinter 2: Buttons
Getting Started with Go Assembly
Getting Started with PKI with Golang ―― 4
Getting Started with Python Django (1)
Getting Started with Python Django (4)
Getting Started with Python Django (3)
Getting Started with Python Django (6)
Getting Started with Django with PyCharm
Python3 | Getting Started with numpy
Getting Started with Python Django (5)
Minimum knowledge to get started with the Python logging module
Getting Started with Poetry From installation to execution and version control
Getting Started with Python responder v2
Link to get started with python
Getting Started with Git (1) History Storage
Getting started with Sphinx. Generate docstring with Sphinx
Getting Started with Python Web Applications
Getting Started with Python for PHPer-Classes
Getting Started with Sparse Matrix with scipy.sparse
Getting Started with Julia for Pythonista
Data engineers learn DevOps with a view to MLOps. ① Getting started
Getting Started with Python Basics of Python
How to get started with Scrapy
How to get started with Python
Getting started on how to solve linear programming problems with PuLP
How to get started with Django
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Getting Started with Cisco Spark REST-API
Getting started with USD on Windows
I want to do ○○ with Pandas
Getting Started with Python Genetic Algorithms
Getting started with Python 3.8 on Windows
Getting Started with Python for PHPer-Functions
Getting Started with CPU Steal Time
[Python] To get started with Python, you must first make sure you can use Python.
Step notes to get started with django
Getting Started with Flask with Azure Web Apps
Python to remember only with hello, worlds
Getting Started with Python Web Scraping Practice
I tried to get started with Hy