I've had several opportunities to use pandas for in-house development tasks, I wonder how to do it because it's not so frequent There are many things like how to do it in the first place. Also, since there are many opportunities to handle several GB of data due to its nature, I will update the method to handle such cases from time to time. Therefore, the composition is described in the form of what you want to do ⇒ how to do it.
By the way, it's not like doing gorigori machine learning, so Please note that we will not be talking about that area.
In the first place, nothing can be done unless csv can be read. So from how to read. The basics are as follows.
df = pd.read_csv("file name")
However, when the file size becomes GB, There is a high possibility that you will not be able to survive the memory. In such a case, add the chunksize option and load it separately.
If you specify chunksize, It is loaded as a TextFileReader instance instead of a Dataframe. If you turn the TextFileReader in a loop, you can retrieve the DataFrame. In the following example, 50 lines are taken out and printed.
data = pd.read_csv("test.csv", chunksize=50)
for i in data:
print(i)
Option name | meaning | Example |
---|---|---|
encoding | Character code specification | encoding='UTF-8' |
skiprows | Specify the line to skip | skiprows=2 |
chunksize | Read every specified number of lines | chunksize=50 |
usecols | Read only specified columns | usecols=[1, 3] |
[Other options](https://own-search-and-study.xyz/2015/09/03/pandas%E3%81%AEread_csv%E3%81%AE%E5%85%A8%E5%BC%95 % E6% 95% B0% E3% 82% 92% E4% BD% BF% E3% 81% 84% E3% 81% 93% E3% 81% AA% E3% 81% 99 /)
How to read and concatenate separate files
#Read file 1
Data1 = pd.read_csv(file1, dtype = np.float32)
#File 2 read
Data2 = pd.read_csv(file2, dtype = np.float32)
#2 Concatenate data
rawData = pd.concat([Data1, Data2], ignore_index=True)
df['A']
df[1:3]
Note that the line starts from 0, so in this case the first line cannot be obtained. Also,: 3 means up to 2.
The original dataframe is changed by specifying inplace = True. In the following example, the City and Price columns disappear from Data.
df.drop(columns=['City', 'Price'], axis = 1, inplace=True)
Can be specified by column number
df.drop(columns=[[1, 2]], axis = 1, inplace=True)
df.drop(df.index[[1, 3, 5]])
By default, the row index is a serial number from 0, Note that the index will not be a serial number after sorting etc.
df.replace({Column name: {Original value:Value after replacement}})
Used when changing the process depending on the column value. Since it returns a GroupBy object, you can mess with the for statement as it is.
for column value,Data frame in df.groupby('Column name'):
Each process
In addition, since the index is scattered, Use reset_index () to re-paste the index in each data frame.
for city,sdf in df.groupby('city'):
sdf.reset_index(drop=True)
if city = 'Tokyo':
flags = 1
Returns the number of elements in the array in the first column. Without [0], each matrix is counted.
pd.shape[0]
means = df.mean(axis = 0)
std = df.std(axis = 0)
To save as csv after processing the dataframe:
df.to_csv("file name")
Option name | meaning | Example |
---|---|---|
columns | Export only specific columns | columns=['age'] |
header | With or without header | header=False |
index | Presence or absence of index | index=False |
Export / add csv file with pandas (to_csv)
Differences between Pandas and NumPy and how to use them properly Group DataFrame by column value-pandas Drop by specifying the row / column of pandas.DataFrame Read csv / tsv file with pandas (read_csv, read_table) Reassign the index of DataFrame with reset_index-python
Recommended Posts