Pandas is an inevitable library for data analysis in Python. However, this Pandas has a high hurdle for beginners ... I also had a lot of trouble, so I will try to put it together from the basics. I'm not an advanced Pandas player myself, so I'd appreciate your comments and advice. In addition, this article will be composed of three parts. (Because it is planned, there is a great possibility that it will change ...)
Let's get started (* ^ ▽ ^ *)
Please note that importing the following libraries will be omitted in future notations.
import pandas as pd
import numpy as np
There are two main types of Pandas, Series and DataFrame class. To briefly explain the difference, Series can handle one-dimensional data (vectors), and DataFrame can handle two-dimensional data (matrix). Let's check Series first. pandas.Series
sr = pd.Series({'taro': 'kyoto', 'jiro': 'osaka', 'saburo': 'nara'})
print(sr.head())
In the above code, the attribute name of the dictionary corresponds to the row label (index) of Series, and the value of the dictionary corresponds to the value of Series.
sr = pd.Series(['kyoto', 'osaka', 'nara'], index=['taro', 'jiro','saburo'] )
print(sr.head())
You can set the row label (index) as described above. By the way, if you do not set the row label, no error will occur, but if you do not set it explicitly, the number 0,1 will be set automatically.
The Series still has many features, but we will omit them because there are many duplicates with the DataFrame. ** However, please note that the usage is different between DataFrame and Series even if the attribute name is the same. ** ** (I think this is also one of the factors that make Pandas difficult ...)
Speaking of Pandas, it's a DataFrame. Let's take a look at the official reference for DataFrame. pandas.DataFrame
The following three are important arguments for the pandas.DataFrame constructor.
Let's take a look at the actual program.
First is the case of creating Datarfame using column data. The attribute name of the dictionary is used as the column label (columns), and the value (registered as an array) is treated as the row (index). (* An error will occur if the values are not arranged in an array) ↑ Apparently, if you set the index properly, you will not get an error even if it is not an array. (It's complicated ...)
data = {
'name': ['taro', 'jiro', 'saburo'],
'address': ['kyoto', 'osaka', 'nara'],
'birth': ['2020-01-01T12:00:00', '2020-02-11T12:00:00', '2020-03-22T12:00:00']
}
df = pd.DataFrame(data = data)
print(df.head())
Next, create a Dataframe using the row data. No error will occur without colums, but in that case columns will be automatically labeled with numbers such as 0,1.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, columns = ['name', 'address', 'birth'])
print(df.head())
By the way, if you want to set both the row label (index) and the column label (columns), do as follows. Set both index and columns.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.head())
I think there are many opportunities to extract specific column data from the data. In that case, it is more flexible to use loc, which will be described later, but it is also possible to narrow down the columns by directly specifying it in the Dataframe. ** You don't need to remember this method at first because loc, which will be described later, is more sophisticated (it is confusing) **
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, columns = ['name', 'address', 'birth'])
print(df['name'])
The above code is extracting the "name" column. You will notice that the table-like display has changed. This is because the result of extracting the columns is Series, not Dataframe.
If you want to extract multiple columns, do as follows.
df[['name', 'address']]
If you extract multiple columns, the extracted result will be a Dataframe instead of a Series.
Rows can be extracted as well as columns. Use slices to extract. ** You don't need to remember this method at first because loc, which will be described later, is more sophisticated (it is confusing) **
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df['a':'b'])
In the above, the "a" and "b" rows are extracted.
** Unlike columns, multiple rows cannot be individually selected and extracted. ** **
(For example, you cannot extract lines a and c)
Also, if you want to select only row a, you need to specify it in slices.
(For example, print (df ['a':'a'])
can extract only a line.)
df ['a':'a']
, you may not be able to extract line a? ,about it.
In DataFrame, row label / column label slice specifications include the right subscript (closed interval), but row number / column number slices do not include the right subscript (left closed, right open) like normal slices. section).
(I'm also personally addicted to this ...)Dataframe has the following attributes.
All of them are important, so let's look at them one by one.
T can get the transposed matrix. Simply put, you can get data with swapped rows and columns.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.T)
By using at and iat, you can get the value at any position in ** Dataframe. ** ** ** For dataframe at and iat, two arguments are always required **. (To limit the position to one) The difference between at and iat is that at specifies the position by row label and column label, while iat specifies by row number and column number.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.at['a','address']) # kyoto
In the case of the above code, since the "address" column of the "a" row is specified, kyoto can be obtained.
When using iat, it will be as follows.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
print(df.iat[1,2]) #2020-02-11T12:00:00
Since the "2" column of the "1" row is specified, 2020-02-11T12: 00: 00 can be obtained. (Row numbers and column numbers start at 0)
I will explain loc, which is so frequent that it is not an exaggeration to say that it is the most important in Dataframe. If you remember how to use loc, you can almost handle basic contents. The basic syntax of loc is as follows. ** When using slices, writing in an array will result in an error. ** **
Dataframe.loc[[<Row label>], [Column label]]
Dataframe.loc[Row label A:Row label B,Column label A:Column label B]
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Single row extraction
print(df.loc[['a']])
#Extraction of multiple lines
print(df.loc[['a', 'b']])
#Single column extraction
print(df.loc[:, ['name']])
#Extraction of multiple columns
print(df.loc[:, ['name', 'address']])
#Row and column combination extraction
print(df.loc[['a', 'c'], ['name', 'birth']])
print(df.loc['a':'c', ['name', 'birth']])
Even if the row label and column label parts are not written in an array, it is possible to extract one row and one column, but the behavior is different from the case of extracting one row and one column in the array. (Be careful as this specification is also a source of confusion ...)
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Single row extraction
df.loc[['a']]
type(df.loc[['a']]) # pandas.core.frame.DataFrame
#Single row extraction
df.loc['a']
print(type(df.loc['a'])) # pandas.core.series.Series
#Single column extraction
df.loc[:, ['name']]
print(type(df.loc[:, ['name']])) # pandas.core.frame.DataFrame
#Single column extraction
df.loc[:, 'name']
print(type(df.loc[:, 'name'])) # pandas.core.series.Series
As in the above code, if you specify it as an array, Dataframe will be returned, and if you specify it without an array, Series will be returned. (** Be careful when actually using **)
There are many situations where you want to check what kind of column labels and row labels are set in the Dataframe. You can check the list of column labels by columns and row labels by index.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Show column label
print(df.columns) # Index(['name', 'address', 'birth'], dtype='object')
#Show row label
print(df.index) # Index(['a', 'b', 'c'], dtype='object')
There are many times when you want to see how many rows and columns of data exist in a Dataframe. You can get the number of rows and columns by using shape. The first tuple returned in shape is the number of rows, and the second is the number of columns.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Get the number of rows and columns
print(df.shape) # (3, 3)
DataFrame comes with information such as row labels and column labels in addition to values, but if you don't need label information and only data, you can convert it to a numpy array.
data = [
['taro', 'kyoto', '2020-01-01T12:00:00'],
['jiro', 'osaka', '2020-02-11T12:00:00'],
['saburo', 'nara', '2020-03-22T12:00:00']
]
df = pd.DataFrame(data = data, index = ['a', 'b', 'c'], columns = ['name', 'address', 'birth'])
#Get a numpy array
print(df.values)
If you convert it to a numpy array, you can get the data by the accelerator method similar to a normal 2D array. ** Please note that the data extraction method is different for DataFrame, Series, and numpy arrays. ** ** (This is also a source of confusion ...)
#numpy arrays can access data just like regular 2D arrays
#Extract the 0th row and 1st column
print(df.values[0][1]) # kyoto
#Extract line 0
print(df.values[0])
#You can also use slices
#Extract the first row
print(df.values[:, 1]) # ['kyoto' 'osaka' 'nara']
Pandas is certainly difficult, but the official reference is written in a very easy-to-understand manner, so I think that it is not an unreasonable hurdle to understand if you read the reference carefully and proceed with learning. (Maybe I'm just thinking about it, but ...). In particular, 10 minutes to pandas is the first because the content is compact and easy to understand. I highly recommend it as a starting point.
I think you've used the phrase "source of confusion" about four times in this article. Please note that this part is particularly complicated, so if you do not understand it well, it may not be possible to handle when the amount of data increases or when you combine data.
Part 2 will finally introduce Pandas methods. There are so many types of methods that it is difficult to learn, but I will write them as easily as possible, so thank you. (^^ ♪ Then (^ _-)-☆
Recommended Posts