[PYTHON] Wow Pandas Let's learn a lot

Wow Pandas Let's learn a lot

Introduction

Pandas is a library that provides functions to support data analysis in the programming language Python [^ wiki]. I think Pandas is complicated even in the Python library [^ atm]. However, the degree of freedom is so high that it is unthinkable for data analysts to analyze data without Pandas. So, I would like to explain to the point that "If you understand this far, you can do anything (if you look at other sites)" [^ title].

[^ wiki]: See https://ja.wikipedia.org/wiki/Pandas

[^ atm]: But there is an atmosphere that cannot be said to be difficult.

[^ title]: Pandas is not a language, but the title fits nicely.

How to capture

\ 1. Preparation

\ 2. Introduction to Pandas

After doing so, you should get on track and reach a level where you can investigate various things yourself (you should be able to understand group by and so on smoothly).

For example, with Numpy

arr = np.arange(12) #arr is a one-dimensional ndarray
arr = arr.reshape(3,4) #arr is a two-dimensional ndarrary
# arr[i,j]The first element of is a row, the second element is a column
arr[:2] #2D ndarray
arr[:2, 0] #1D ndarray
arr[:, arr[0] > 2] #2D ndarray

With Pandas

pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
       'Ohio' : {2000 : 1.5, 2001 : 1.7}}
df = DataFrame(pop) # DataFrame(2D)
df[df['Nevada'] > 2] # DataFrame(2D)
df.iloc[-1:]['Nevada'] # Series(1D)

What is the type like that? If you are aware of that and understand it, it seems that half is over.


So, let's summarize the behavior of the index reference of ndarray (2D) and then proceed to Pandas ~

Preparation

import

import numpy as np # ndarray
#Needed to display matplot in jupyter
%matplotlib inline
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
import pandas as pd

Numpy

Let's take a two-dimensional ndarray. To understand Pandas There are two things to understand here:

Example

arr = np.arange(12).reshape(3,4) #arr is a two-dimensional ndarrary(3 rows 4 columns)
#array([[ 0,  1,  2,  3],
#       [ 4,  5,  6,  7],
#       [ 8,  9, 10, 11]])
#Get a one-dimensional ndarray
arr[1] #Element reference by scalar value
arr[0:2] #Slicing Extract the 0th and 1st lines(The second line is not extracted)
##For each element in the first line(>2)Returns the boolean value of
arr[1] > 2 # array([ True,  True,  True,  True], dtype=bool) 

#Get a two-dimensional ndarray
arr>2 #Boolean index reference
arr[np.array([True, False, True])] #Extract lines 0 and 2 with Boolean index reference
# arr[[True, False, True]] # Warning
arr[[0,2,1]] #See fancy index:For index reference(integer)Use an array Extract the 0th, 2nd, and 1st lines in order

It's basically the same as a one-dimensional ndarray, but note only the pitfalls that are easy to fall into:

#When you want to specify only the second argument. The first argument cannot be omitted. At that time slicing`:`As the first argument
arr[:, 1]

#If you specify a fancy index for the first and second arguments, the operation will be a little unintuitive.
## (Also note that it will be a one-dimensional ndarray!
## np.array([arr[i,j] for i,j in zip([1,2], [0,1])])Equivalent to.# array([4, 9])
arr[[1,2], [0,1]]
## 1,2nd line and 0,To get a 2D ndarray that extracts the area of the first column, do the following:
arr[np.iloc_([1,2], [0,1])]
array([[4, 5],
       [8, 9]])
#line
arr[:,1] > 2 # array([False,  True,  True], dtype=bool)

#The first row is(>2)Extract lines that look like
arr[arr[:,1] > 2] # arr[np.array([False,  True,  True])]Same as.(I haven't used it much personally)
# arr[arr[:, 1] > 2, :]Same as.


arr[1] > 5
arr[:, arr[1] > 5] # array([False, False,  True,  True], dtype=bool)
#arr[:, np.array([False, False, True, True])] #Same as

In summary, the behavior of the index reference type in ndarray (2D) looks like this [^ summary]:

[^ summary]: It's a little forcible. The "None" column of the second argument points to ʻarr [・] . The parentheses in (1d) mean that you don't use them too much. 1d stands for 1D ndarray and 2d stands for 2Dndarray`.

First argument\Second argument None scalar Slicing Boolean index Fancy index
None -
scalar 1d 0d 1d 1d 1d
Slicing 2d 2d 2d 2d 2d
Boolean index 2d 1d 2d (2d) (1d)
Fancy index 2d 1d 2d (1d) (1d)
#A trap that can be mistaken for something else. I want to triple the element of arr
> arr = [0,1,2,3]
> arr*4
[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]

> np.arange(4)*4
[0,4,8,12]
#If you want to do the same without converting to numpy, use comprehensions.
> [i*4 for i in range(4)]
[0,4,8,12]

Introduction to Pandas

In Numpy, it should have been treated as the same ndarray regardless of whether it is 1D or 2D, but in Pandas, it is divided as 1D => Series, 2D => DataFrame. I am. So, although the names are different, the DataFrame and Series cannot be separated because they go back and forth between 2D <=> 1D.

For example, you can extract a one-dimensional Series by specifying a single row / column from DataFrame. Conversely, you can create a DataFrame by specifying the Series (1D) listʻordict as an argument to the DataFrame` (2D) constructor.

So, understanding whether a variable is one-dimensional or two-dimensional is important even if the name changes to Series, DataFrame.

About Series

Creating a Series

Basically, I often put dict, list in the constructor. In the case of dict, it will be Series with index.

#Example of thrusting a dict
dic = {'word' : 470, 'camera' : 78}
Series(dic)
#Often, a zip and dict combination technique is used to generate a Series:
Series(dict(zip(words, frequency)))

Index reference

For index references, it is an extension of the one-dimensional ndarray. The difference is that the index name can also be included as an index argument.

ser = Series(np.random.randn(5), index = list('ABCDE'))
#A    1.700973
#B    1.061330
#C    0.695804
#D   -0.435989
#E   -0.332942
#dtype: float64

#Slicing
ser[1] #The first line, that is'A'Extract row 0 dimension(type =float64 type) 
ser['A'] # 'A'Extract rows(type = float)
ser[1:3] #1,Extract the second line(Series(One dimensional)
ser[-1:] #Extract the last line
ser[:-1] #Extract all rows except the last row
ser[[1,2]] # 1,Extract the second line(Fancy index)
ser[['A', 'B']] # (Fancy)You can also give the index as a string
ser > 0 #The type of ser is Series(1D)Each element is a boolean value
ser[ser > 0] #Boolean index(ser > 0)Element reference with

# Read,Since both can be written, it is also possible to write the rvalue only to the corresponding one, as shown below..
#The technique of bringing a condition to an lvalue is often used in DataFrame.
ser[ser > 0] = 0

About DataFrame

Creating a DataFrame

#When both outside and inside are dict
pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
       'Ohio' : {2000 : 1.5, 2001 : 1.7}}
df2 = DataFrame(pop)
#      Nevada  Ohio
#2000     NaN   1.5
#2001     2.4   1.7
#2002     2.9   NaN

#The outside is dict,When the inside is series
# df1,df2 is a DataFrame type(So df1['name'], df2['address']Is a Series type)
##column name is['typeA', 'typeB'],index name is[0,1,2,3]
dfA = DataFrame({'typeA' : df1['name'], 'typeB' : df2['address']})
##index name is[0,1,2,3],column name is['name', 'address'](attribute T is transposed)
dfB = DataFrame([df1['name'], df2['address']]).T

We often use the + builtin zip function to create a DataFrame:

dict(zip([1,2,3], [4,5,6,7])) #{1: 4, 2: 5, 3: 6} =>Cannot be converted to DataFrame
list(zip([1,2,3], [4,5,6,7])) #[(1, 4), (2, 5), (3, 6)] =>Can be converted to DataFrame(outside:List, inside:Because it's a tuple)
pd.DataFrame(list(zip([1,2,3], [4,5,6,7]))) # => OK!
df = DataFrame(np.arange(12).reshape(3,4), columns = list('ABCD'))
print(df)
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
DataFrame(Series({'word' : 470, 'camera' : 78}), columns = ['frequency'])

Creating a DataFrame from aSeries will be discussed in detail in the Data Addition section of the beginner's edition.

DataFrame index reference

In Pandas,df [・]ordf.loc [<row specification>] ʻor,df.loc [, ] ʻordf.iloc [<row specification> ]ordf.iloc [<row specification>, <column specification>]can be created. df [・] behaves quite confusingly as follows.

#I often use
#dfA[1] # runtime error!!The first column cannot be retrieved as an integer value
#dfA['typeA'] #'typeA'Series columns(1D)Extracted as
dfA[['typeB', 'typeA']] # typeB,DataFrame with type A columns in order(2D)Extracted as
dfA['typeA'] > 3 #1D Series(Each element is a boolean value)


#A little confusing(I often use it personally)
dfA[dfA['typeA'] > 3] #dfA'typeA'Extract rows with 3 or more columns.
# dfA.loc[dfA['typeA'] > 3] #If you are worried, use this


#Below, it's quite complicated, so I don't use it much.
dfA[1:] #The first line~DataFrame(2D)Extracted as(Note that it is a row extraction)
#dfA[1:]I would write this rather than myself.
dfA.loc[1:] #Clarified that it is a line specification. Or dfA.loc[1:, :]

df.loc is a version where you can specify the label name of Numpy. So basically, you should write it with the same glue as the index reference of Numpy.

Notes on df.loc

However, there are two things to keep in mind when dealing with df.loc (quite important and easy to get stuck in).

One is that df.loc has priority over the label name, so even when an integer value is specified for ʻindex, the index number is not referenced, but the line corresponding to the label name is extracted. is. For example, when you want to sort` and extract the first row, it is quite easy to get an accident:

dic = list(zip([0,3,5,6], list('ADCB')))
dfA = DataFrame(dic, columns = ['typeA', 'typeB'])
#   typeA typeB
#0      0     A
#1      3     D
#2      5     C
#3      6     B
dfA = dfA.sort_values(by = 'typeB')
#   typeA typeB
#0      0     A
#3      6     B
#2      5     C
#1      3     D
dfA.loc[1] #1st(In other words, the second place)I want to extract rows, but when I use loc, the rows with index label name 1 are extracted:
#typeA    3
#typeB    D
#Name: 1, dtype: object


##To prevent such a tragedy, df.Use iloc. The line number has priority.
##(#3      6     B)Can be extracted
dfA.iloc[1]

ʻIloc is often used after extraction. (If ʻindex is not in numerical order, it cannot be referenced byloc [number].)

df = df[df['A'] == name]
df.iloc[0]['B'] #It feels a little uncluttered...

The other is a trap that is easy to fall into when dealing with integer index, but if you want to extract the last row, referencing a negative value in df.loc will fail. Since the label name has priority, it is said that there is no -1 label. Again, use df.iloc to emphasize that the extraction is for line numbers.

# dfA.loc[-1] : NG
dfA.iloc[-1] # OK(The last line is Series(1D)Extracted as)
dfA.iloc[-1:] # OK(The last line is DataFrame(2D)Extracted as)

On the contrary, df.iloc can only use numbers, so if you want to specify rows by row numbers and columns by label names, write as follows.

df.iloc[i]['A'] #It is good to write like this
#iloc can only specify columns with numbers
# Location based indexing can only have
# [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
# res *= df.iloc[i, 'A'] #error

To summarize the index reference of DataFrame,

If you keep only these two points in mind, you can extract data as enjoyably as the index reference of Numpy. It's really easy if you only need to remember one type of df.loc, but integer indexes are so popular in practice that you can't avoid using df.iloc: sweat. ::

Supplement

loc, ʻiloc` cannot be copied to the lvalue when indexer is used twice as shown below. (The value you want to modify in the original DataFrame is not modified).

#A value is trying to be set on a copy of a slice from a DataFrame
df.loc[5]['colA'] #Cannot be an lvalue

#no problem!(Because it is a reference)
df.loc[k, 'non_view_rate'] *= mult

Pandas Beginner Edition

So far we've looked at Pandas index references. Maybe it's over the mountain, but there are still some confusing parts such as additions and corrections to DataFrame. For the basic usage of each function, [Introduction to data analysis by Python --- Data processing using NumPy, pandas](https://www.amazon.co.jp/ Introduction to data analysis by Python --- NumPy, pandas Data processing using-Wes-McKinney / dp / 4873116554), and here I would like to summarize it in a reverse way.

How to shape it using Pandas

Add (consolidate)

ser = Series([1,2,3], index = list('ABC'))
#A    1
#B    2
#C    3
#dtype: int64

Will be expressed as Series (3 * 1). The index names are all the same. (['A','B','C']) Let's see how to concatenate various patterns of data.

(Series(31) <- Series(31)) -> DataFrame

DataFrame([s1, s2]) #Using the constructor
df = DataFrame([s1, s2], index = list('AB')).T
pd.concat([s1, s2], axis = 1) #If you want to stack downwards, concat(.., axis = 1)Should be used
serA.append(serB)
#Or
pd.concat([serA, serB])
#If you want the index name to be a serial number of 0..
s1.append(s2).reset_index(drop = True) #Re-sort the index
df1 = DataFrame(serA)
df2 = DataFrame(serB)
ndf = df1.join(df2, how = 'outer', lsuffix = 'A', rsuffix = 'B') #I wonder
#Only two can be connected here.
ndf = pd.merge(df1, df2, left_index=True, right_index=True, how='outer') 
   0A  1A  2A  0B  1B  2B
0   1   2   3   4   5   6

(DataFrame(n3) <- Series(31)) => DataFrame

# Can only append a Series if ignore_index=True or if the Series has a index name
df.append(serA, ignore_index = True)

cols = ['colA', 'colB', 'colC']
res_df = DataFrame(columns = cols)
res_df = res_df.append(Series([1,2,3], cols).T, ignore_index = True)
...

I want to stack multiple Series

(Series(3*1) + Series(3*1)) + Series(3*1) + Series(3*1)-> DataFrame(4*3)

df = DataFrame([serA, serB, serC, serD])
# DataFrame(3*4)If you want to.Just add T
df = DataFrame([serA, serB, serC, serD]).T

I want to add the last line (Series) downward to an existing DataFrame (or add a DataFrame downward)

#Add one line
df.loc['newrow'] = 0
df.append(serA, ignore_index = True)
#Add multiple lines
df1.append(df2)
#Add multiple DataFrames(Poke the list)
df1.append([df2, df3, df4])
#Or
pd.concat([df1, df2, df3, df4])

#1 column added
df['newcol'] = 0

The indexes are almost the same, and I want to connect df to the right

#Not applicable index is outer join(outer)To.(NAN値To.)
df1.join(df2, how = 'outer')
df1.join([df2, df3], how = 'outer')
#merge can be more detailed, but limited to merging two DataFrames:
df1.merge(df1, how = 'outer')

Other additions

For other details, see the official page http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.DataFrame.merge.html or http://sinhrks.hatenablog.com/entry / 2015/01/28/073 I think 327 is good.

For the latter,

+Simple vertical concatenation DataFrame.append
+Flexible concatenation pd.concat
+Join by column value pd.merge
+Join by index DataFrame.join (make easy version of merge)

It is easy to understand because it is written with figures and examples.


Modify the contents of the DataFrame

I want to rename all index and columns

#Rename index
df.index = ['one', 'two', 'three']
#index number reassignment
df.reset_index(drop = True) #Re-sort the index(From 0~)

#Rename columns
##table creation=>After editing, the column may not be in the expected order, so
##It is safer to explicitly specify the column order.
df = df[['old_a', 'old_b', 'old_c']] 
df.columns = ['new_a', 'new_b', 'new_c']

#Or df.use rename
df = df[['old_a', 'old_b', 'old_c']] #Either way, if you don't care about the order of the columns.
#Since rename is not a destructive method, it must be assigned to an lvalue. Specify columns for parameter.(Note that it is not an axis parameter)
df = df.rename(columns = {'old_a' : 'new_a', 'old_b' : 'new_b', 'old_c' : 'new_c'})

Note1) You can also use df.rename when you want to change some index and column names. (Specify the dict type (as a correspondence table before-change) in the ʻindex or columns parameter. Note that there is no ʻaxis parameter. The rest isinstead ofcolumn. columns (with s))

Note2) reindex is a replacement of the existing index position, not an index name change. set_index creates a new object using one or more specific columns as an index, such as df.set_index (['c1','c0']). Note that this is not a method for renaming index. reset_index converts a hierarchical index to a column. Just the relationship of set_index <=> reset_index.

Substitution

# 'A'Focus on the columns'wrong'Select the rows that are and of those rows'B'Column'sth'Change to
df.loc[df['A'] == 'wrong', 'B'] = 'sth'

Extraction

http://naotoogawa.hatenablog.jp/entry/2015/09/12/PandasのDataFrameの嵌りどころ

#Enclose each Boolean index in parentheses
df = df[(df['A'] > 0) | (df['B'] > 0)]
#apply is a function that takes a Series as an argument(Lambda expression)Into the first argument
#map is a function that takes an element as an argument(Lambda expression)Into the first argument
df = df[df['A'].map(lambda d : d in listA)]

Delete

Delete rows and columns with df.drop (non-destractive). If you specify axis, you can delete both rows and columns.

df = df.drop("A", axis=1)
#column is'A', 'B', .. 'F'so'C'From the column'F'列まso削除したいときとかは、以下のようにする方が多い
df = df[['A', 'B']]

get rid of na

See http://nekoyukimmm.hatenablog.com/entry/2015/02/25/222414.

Boolean index reference

#Returns DataFrame type
df.apply(lambda ser: ser % 2 == 0)
df.applymap(lambda x: x % 2 == 0)
df['goal'] == 0
df.isin([1,2])
df = df[~df.index.duplicated()] #Remove duplicate index(Delete the data that appears after the second time)
#Returns Series type
df.apply(lambda ser : (ser > 0).any())
df['A'].map(lambda x : x > -1)
serA > serB #series type
-bool_ser #Flip the index of an element of a bool index
#The second argument is only the element that becomes False in the first argument
df['A'].where(df['A'] > 0, -df['A']) #Series version of abs(If it does not apply to the first argument, add a negative sign(In other words, it becomes positive because it is negative.)
(df['goal'] == 0).all() #True if you are addicted to all the conditions
df.apply(lambda ser: ser % 2 == 0)
(df['cdf(%)'] < 90).sum() #Count the number that meets the conditions
df.where(df % 3 == 0, -df)

Writing style that NA value may be assigned

It's quite common to get stuck around NA, so make a note of where the NA value may be generated.

dic = dict(zip(list('ABCD'), [3,4,6,2])) #Generate dict
ser = Series(dic, index = list('ABCDE'))
#Column E not in dic is NAN
#A    3.0
#B    4.0
#C    6.0
#D    2.0
#E    NaN
#dtype: float64

(Example)

pop = {'Nevada' : {2001 : 2.4, 2002 : 2.9},
       'Ohio' : {2000 : 1.5, 2001 : 1.7}}
df = DataFrame(pop)

      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   NaN

Example omitted

df.loc[[2002, 2001, 1999], ['Alaska', 'Nevada']]

      Alaska  Nevada
2002     NaN     2.9
2001     NaN     2.4
1999     NaN     NaN

Note) df ['non_exists'], df.loc [:,'non_exists'] (specify a name that is not in the column) and an error.

How to get rid of NA value

# df ..index jumps(index.name : B_idx, columns = ['A'] =>index serial number(0~89)I want to set the interpolation value to 0
old_df = DataFrame(index = range(90), columns = 'A')
new_df = old_df.combine_first(df).fillna(0) # index.name disappears

Manipulating strings

Especially because the character string operation of Series is often used soberly. It can be used not only for elements but also for index names and column names!

See http://pandas.pydata.org/pandas-docs/stable/text.html (especially at the bottom) for more information. If you want to operate a character string in DataFrame, you can usually solve it by looking here.

You can use it, for example, when you want to extract only the lines that match a certain regular expression.

#df'A'Update df by extracting only the rows that start with lowercase letters in the column
r = '^[a-z]'
df = df[df['A'].str.match(r)] # df['A'].str.match(r)Is a Boolean index

References

[1] [Introduction to data analysis using Python --- Data processing using NumPy and pandas](https://www.amazon.co.jp/ Introduction to data analysis using Python --- Data processing using NumPy and pandas-Wes -McKinney / dp / 4873116554)

[2] http://sinhrks.hatenablog.com/entry/2015/01/28/073327

[3] Documentation http://pandas.pydata.org/pandas-docs/stable/api.html

Recommended Posts

Wow Pandas Let's learn a lot
Learn Pandas in 10 minutes
Learn Pandas with Cheminformatics
[Blender x Python] Let's arrange a lot of Susanne neatly !!
Let's make a Discord Bot.
Learn librosa with a tutorial 1
Let's try a shell script
Let's draw a logistic function
Let's make a rock-paper-scissors game