[PYTHON] Data processing tips with Pandas

These are tips for data processing by Pandas, which doubles as a personal memorandum. I wrote down what I didn't get caught when I googled. We plan to add more and more. We would appreciate it if you could let us know if you have any mistakes or improvements.

The first thing to look at is the cheat sheet

Overwhelming thanks to those who translated it into Japanese. https://qiita.com/s_katagiri/items/4cd7dee37aae7a1e1fc0

Apply the function to multiple variables and save the return value in another variable.

Example: Put the number of "@" contained in x1 into cnt_x1, which is also done for x2, x3, .... x1→cnt_x1, ..., x13→cnt_x13

migs = {'cnt_x1': 'x1', 'cnt_x2': 'x2', ...,  'cnt_x13': 'x13'}

for vars, mig in migs.items():
    df1[vars] = df1[mig].str.count('@')

--keys (): for loop processing for key key of each element --values (): for loop processing for the value value of each element --items (): for loop processing for key key and value value of each element

Use a dictionary. The correspondence between the key and value of the dictionary is as follows. {'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}

Send a query to postgres to make a data frame (also get the header)

Write the query enclosed in'''in cur.execute (). Personally (in the case of Postgres) I wrote it after checking the movement with PgAdmin.

import psycopg2
import pandas as pd
conn = psycopg2.connect("host=hostname  user=username port=port dbname=dbname password=password")
# execute sql
cur = conn.cursor()
#Schema name.table name
cur.execute('''
select *
from hoge
;''')
results = cur.fetchall()
#I want to be df
df = pd.DataFrame(results, columns=[col.name for col in cur.description])
cur.close()
conn.close()

How to create an empty file and write the current number of samples in the file name to understand the situation

If you query the above postgres, combine it with a dataframe program, and make it run regularly in the Windows task scheduler, you can grasp the status of the sample in the database every day (weekly, hourly, etc.) can do.

allcnt = len(df)

with open(r"./date" + str(date) +  r"_Total_" + str(allcnt) + r"_Domestic_" + str(domestic) + r"_overseas_" + str(foreign) + r".txt","w"):pass

Recommended Posts

Data processing tips with Pandas
Processing datasets with pandas (1)
Processing datasets with pandas (2)
Data visualization with pandas
Data manipulation with Pandas!
Example of efficient data processing with PANDAS
Process csv data with python (count processing using pandas)
Versatile data plotting with pandas + matplotlib
[pandas] GroupBy Tips
Read pandas data
Tips for plotting multiple lines with pandas
Try converting to tidy data with pandas
Best practices for messing with data with pandas
Try to aggregate doujin music data with pandas
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Draw a graph by processing with Pandas groupby
[Pandas] Basics of processing date data using dt
Interactively visualize data with TreasureData, Pandas and Jupyter.
100 language processing knock-20 (using pandas): reading JSON data
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
Data analysis with python 2
Image processing with MyHDL
Convert 202003 to 2020-03 with pandas
[Tips] My Pandas Note
Merge datasets with pandas
Visualize data with Streamlit
Learn Pandas with Cheminformatics
Reading data with TensorFlow
Image processing with Python
Parallel processing with multiprocessing
Data Augmentation with openCV
Normarize data with Scipy
Data analysis with Python
LOAD DATA with PyMysql
Image Processing with PIL
Get Amazon RDS (PostgreSQL) data using SQL with pandas
How to convert horizontally held data to vertically held data with pandas
Be careful when reading data with pandas (specify dtype)
Data analysis environment construction with Python (IPython notebook + Pandas)
Overview and tips of seaborn with statistical data visualization
How to extract non-missing value nan data with pandas
How to extract non-missing value nan data with pandas
Processing summary 2 often done in Pandas (data reference, editing operation)
Sample data created with python
100 Language Processing with Python Knock 2015
Read csv with python pandas
Embed audio data with Jupyter
Graph Excel data with matplotlib (1)
Parallel processing with local functions
Image processing with PIL (Pillow)
Artificial data generation with numpy
"Apple processing" with OpenCV3 + Python3
Extract Twitter data with CSV
Acoustic signal processing with Python (2)
Get Youtube data with python
Notes on handling large amounts of data with python + pandas
Acoustic signal processing with Python
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 2]
That's why I quit pandas [Data Science 100 Knock (Structured Data Processing) # 1]
Clustering ID-POS data with LDA
Ingenuity to handle data with Pandas in a memory-saving manner