[PYTHON] Data processing tips with Pandas

These are tips for data processing by Pandas, which doubles as a personal memorandum. I wrote down what I didn't get caught when I googled. We plan to add more and more. We would appreciate it if you could let us know if you have any mistakes or improvements.

The first thing to look at is the cheat sheet

Overwhelming thanks to those who translated it into Japanese. https://qiita.com/s_katagiri/items/4cd7dee37aae7a1e1fc0

Apply the function to multiple variables and save the return value in another variable.

Example: Put the number of "@" contained in x1 into cnt_x1, which is also done for x2, x3, .... x1→cnt_x1, ..., x13→cnt_x13

migs = {'cnt_x1': 'x1', 'cnt_x2': 'x2', ...,  'cnt_x13': 'x13'}

for vars, mig in migs.items():
    df1[vars] = df1[mig].str.count('＠')

--keys (): for loop processing for key key of each element --values (): for loop processing for the value value of each element --items (): for loop processing for key key and value value of each element

Use a dictionary. The correspondence between the key and value of the dictionary is as follows. {'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}

Send a query to postgres to make a data frame (also get the header)

Write the query enclosed in'''in cur.execute (). Personally (in the case of Postgres) I wrote it after checking the movement with PgAdmin.

import psycopg2
import pandas as pd
conn = psycopg2.connect("host=hostname  user=username port=port dbname=dbname password=password")
# execute sql
cur = conn.cursor()
#Schema name.table name
cur.execute('''
select *
from hoge
;''')
results = cur.fetchall()
#I want to be df
df = pd.DataFrame(results, columns=[col.name for col in cur.description])
cur.close()
conn.close()

How to create an empty file and write the current number of samples in the file name to understand the situation

If you query the above postgres, combine it with a dataframe program, and make it run regularly in the Windows task scheduler, you can grasp the status of the sample in the database every day (weekly, hourly, etc.) can do.

allcnt = len(df)

with open(r"./date" + str(date) +  r"_Total_" + str(allcnt) + r"_Domestic_" + str(domestic) + r"_overseas_" + str(foreign) + r".txt","w"):pass

Recommended Posts

Data processing tips with Pandas

Processing datasets with pandas (1)

Processing datasets with pandas (2)

Data visualization with pandas

Data manipulation with Pandas!

Example of efficient data processing with PANDAS

Process csv data with python (count processing using pandas)

Versatile data plotting with pandas + matplotlib

[pandas] GroupBy Tips

Read pandas data

Tips for plotting multiple lines with pandas