There are many operations to remember in pandas, so I just want to remember this for the time being! I personally summarized the contents.
If you are going to touch pandas from now on, please refer to it.
-** Series : Represents one-dimensional data. - DataFrame : Represents two-dimensional data. - Panel **: Represents 3D data.
In mathematics, the vector is Series and the matrix is DataFrame.
If the program has pd.DataFrame (), it can be interpreted as "I'm creating two-dimensional data."
Import is done as follows.
test.py
import pandas as pd
All of the code below omits import pandas as pd, but think of it as being imported.
Let's create a Series first.
test.py
cp = [100,200,300,400]
print(pd.Series(cp))
# 0 100
# 1 200
# 2 300
# 3 400
# dtype: int64
Unlike the list, the index is automatically added when you output it. You can also define this index yourself.
test.py
cp = [100,200,300,400]
cp_index = ["Jan","Feb","Mar","Apr"]
print(pd.Series(cp, index=cp_index))
# Jan 100
# Feb 200
# Mar 300
# Apr 400
# dtype: int64
Next, create a DataFrame.
test.py
cp = {
"a": [400,300,200,100],
"b": [100,200,300,400]
}
cp_index = ["Jan","Feb","Mar","Apr"]
print(pd.DataFrame(cp, index=cp_index))
# a b
# Jan 400 100
# Feb 300 200
# Mar 200 300
# Apr 100 400
DataFrame has become a table shape that you often see in Excel.
In DataFrame, not only the index but also the columns (a and b) are specified. Therefore, the variable cp is defined by dictionary.
You can also do the following:
test.py
df=pd.DataFrame([[400,100],
[300,200],
[200,300],
[100,400]],
index=['Jan', 'Feb','Mar','Apr'],
columns=["a","b"])
print(df)
# a b
# Jan 400 100
# Feb 300 200
# Mar 200 300
# Apr 100 400
However, in reality, you will often read the csv file instead of writing the data yourself.
So, I will introduce how to read data from csv file.
sample.csv
,a,b
Jan,400,100
Feb,300,200
Mar,200,300
Apr,100,400
Use read_csv () to read the csv file.
test.py
df=pd.read_csv("sample.csv",index_col=[0])
print(df)
# a b
# Jan 400 100
# Feb 300 200
# Mar 200 300
# Apr 100 400
In index_col, specify the number of columns to index in a list type.
In the source code below, df uses the same as above.
↓ If you want to extract 3 lines from the top
test.py
print(df.head(3))
# a b
# Jan 400 100
# Feb 300 200
# Mar 200 300
↓ Extract by specifying a column
test.py
print(df["a"])
# Jan 400
# Feb 300
# Mar 200
# Apr 100
# Name: a, dtype: int64
↓ Extract by specifying the index as a slice
test.py
print(df[1:3])
# a b
# Feb 300 200
# Mar 200 300
When extracting from a DataFrame, loc, iloc, ix are often used. All have the same function of "extract by specifying rows and columns".
So what's the difference?
loc
loc specifies rows and columns with ** label name **.
test.py
print(df.loc[["Feb","Mar"]])
# a b
# Feb 300 200
# Mar 200 300
print(df.loc[:,["a"]])
# a
# Jan 400
# Feb 300
# Mar 200
# Apr 100
By the way, the colon in df.loc [:, ["a"]] means "extract all rows".
iloc
iloc specifies rows and columns by ** number **.
test.py
print(df.iloc[[1,3]])
# a b
# Feb 300 200
# Apr 100 400
print(df.iloc[:,[0]])
# a
# Jan 400
# Feb 300
# Mar 200
# Apr 100
print(df.iloc[[1,3],[0]])
# a
# Feb 300
# Apr 100
print(df.iloc[1:3])
# a b
# Feb 300 200
# Mar 200 300
Is i in iloc an i in integer?
ix
For ix, you can specify the row and column by label name, or you can specify by number.
However, if the index or column is an integer type, confusion will occur and it will be confusing, so it seems better to use loc and iloc properly without using ix.
Here are some of the processes that are likely to be used personally.
test.py
#Specify conditions to set the contents of the DataFrame to True and False
print(df >= 300)
# a b
# Jan True False
# Feb True False
# Mar False True
# Apr False True
#Make numbers less than 300 NaN
print(df[df >= 300])
# a b
# Jan 400 NaN
# Feb 300 NaN
# Mar NaN 300
# Apr NaN 400
#Fill the missing value NaN with 0
print(df[df >= 300].fillna(0))
# a b
# Jan 400 0
# Feb 300 0
# Mar 0 300
# Apr 0 400
#Store the missing value NaN as the average value for each column
print(df[df >= 300].fillna(df.mean()))
# a b
# Jan 400 250
# Feb 300 250
# Mar 250 300
# Apr 250 400
Before studying in earnest, this is all! I summarized that.
I think I remembered this and finally set up at the starting point.
I'm also a pandas beginner, so I'll do my best to study.
Recommended Posts