There are many operations to remember in pandas, so I just want to remember this for the time being! I personally summarized the contents.

If you are going to touch pandas from now on, please refer to it.

Series / DataFrame / Panel

-** Series : Represents one-dimensional data. - DataFrame : Represents two-dimensional data. - Panel **: Represents 3D data.

In mathematics, the vector is Series and the matrix is DataFrame.

If the program has pd.DataFrame (), it can be interpreted as "I'm creating two-dimensional data."

Basic operation of generation

Import is done as follows.

`test.py`


import pandas as pd

All of the code below omits import pandas as pd, but think of it as being imported.

Let's create a Series first.

`test.py`


cp = [100,200,300,400]
print(pd.Series(cp))
# 0    100
# 1    200
# 2    300
# 3    400
# dtype: int64

Unlike the list, the index is automatically added when you output it. You can also define this index yourself.

`test.py`


cp = [100,200,300,400]
cp_index = ["Jan","Feb","Mar","Apr"]
print(pd.Series(cp, index=cp_index))
# Jan    100
# Feb    200
# Mar    300
# Apr    400
# dtype: int64

Next, create a DataFrame.

`test.py`


cp = {
	"a": [400,300,200,100],
	"b": [100,200,300,400]
}
cp_index = ["Jan","Feb","Mar","Apr"]
print(pd.DataFrame(cp, index=cp_index))
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300
# Apr  100  400

DataFrame has become a table shape that you often see in Excel.

In DataFrame, not only the index but also the columns (a and b) are specified. Therefore, the variable cp is defined by dictionary.

You can also do the following:

`test.py`


df=pd.DataFrame([[400,100],
              [300,200],
              [200,300],
              [100,400]],
              index=['Jan', 'Feb','Mar','Apr'],
              columns=["a","b"])
print(df)
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300
# Apr  100  400

However, in reality, you will often read the csv file instead of writing the data yourself.

So, I will introduce how to read data from csv file.

`sample.csv`


,a,b
Jan,400,100
Feb,300,200
Mar,200,300
Apr,100,400

Use read_csv () to read the csv file.

`test.py`


df=pd.read_csv("sample.csv",index_col=[0])
print(df)
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300
# Apr  100  400

In index_col, specify the number of columns to index in a list type.

Basic extraction operation

In the source code below, df uses the same as above.

↓ If you want to extract 3 lines from the top

`test.py`


print(df.head(3))
#        a    b
# Jan  400  100
# Feb  300  200
# Mar  200  300

↓ Extract by specifying a column

`test.py`


print(df["a"])
# Jan    400
# Feb    300
# Mar    200
# Apr    100
# Name: a, dtype: int64

↓ Extract by specifying the index as a slice

`test.py`


print(df[1:3])
#        a    b
# Feb  300  200
# Mar  200  300

Difference between loc, iloc and ix

When extracting from a DataFrame, loc, iloc, ix are often used. All have the same function of "extract by specifying rows and columns".

So what's the difference?

loc

loc specifies rows and columns with ** label name **.

`test.py`


print(df.loc[["Feb","Mar"]])
#        a    b
# Feb  300  200
# Mar  200  300

print(df.loc[:,["a"]])
#        a
# Jan  400
# Feb  300
# Mar  200
# Apr  100

By the way, the colon in df.loc [:, ["a"]] means "extract all rows".

iloc

iloc specifies rows and columns by ** number **.

`test.py`


print(df.iloc[[1,3]])
#        a    b
# Feb  300  200
# Apr  100  400

print(df.iloc[:,[0]])
#        a
# Jan  400
# Feb  300
# Mar  200
# Apr  100

print(df.iloc[[1,3],[0]])
#        a
# Feb  300
# Apr  100

print(df.iloc[1:3])
#        a    b
# Feb  300  200
# Mar  200  300

Is i in iloc an i in integer?

For ix, you can specify the row and column by label name, or you can specify by number.

However, if the index or column is an integer type, confusion will occur and it will be confusing, so it seems better to use loc and iloc properly without using ix.

Other processes that are likely to be used

Here are some of the processes that are likely to be used personally.

`test.py`


#Specify conditions to set the contents of the DataFrame to True and False
print(df >= 300)
#        a       b
# Jan  True   False
# Feb  True   False
# Mar  False   True
# Apr  False   True

#Make numbers less than 300 NaN
print(df[df >= 300])
#       a     b
# Jan  400   NaN
# Feb  300   NaN
# Mar  NaN   300
# Apr  NaN   400

#Fill the missing value NaN with 0
print(df[df >= 300].fillna(0))
#       a     b
# Jan  400    0
# Feb  300    0
# Mar   0    300
# Apr   0    400

#Store the missing value NaN as the average value for each column
print(df[df >= 300].fillna(df.mean()))
#       a     b
# Jan  400   250
# Feb  300   250
# Mar  250   300
# Apr  250   400

At the end

Before studying in earnest, this is all! I summarized that.

I think I remembered this and finally set up at the starting point.

I'm also a pandas beginner, so I'll do my best to study.

[PYTHON] Getting Started with pandas: Basic Knowledge to Remember First

Series / DataFrame / Panel

Basic operation of generation

test.py

test.py

test.py

test.py

test.py

sample.csv

test.py

Basic extraction operation

test.py

test.py

test.py

Difference between loc, iloc and ix

test.py

test.py

Other processes that are likely to be used

test.py

At the end

`test.py`

`test.py`

`test.py`

`test.py`

`test.py`

`sample.csv`

`test.py`

`test.py`

`test.py`

`test.py`

`test.py`

`test.py`

`test.py`