[PYTHON] Best practices for messing with data with pandas

In machine learning, after preprocessing of data, while having a hypothesis, Do you play with the data? Do you play with it? I think there is a phase, At that time, how to freely manipulate pandas is I think it will be important.

I myself have a little programming experience and database knowledge,

*** Pandas DataFrame [] ← This is too complicated! !! *** ***

Especially, it is difficult to narrow down by conditions.

train[train["company_id"] == 1088]["meter_reading"]

It's confusing at this point, but if this were train_weather_df, it would be ruined.

train_weather_df[train_weather_df[“company_id”]==1088][“meter_readings”]

Moreover, when there are two conditions, more scary things happen ...

So to do something a little complicated, I think it's better to use ***. Query () ***.

train.query(qry)["meter_reading"]

After narrowing down, I think there is something like trying to make this "Group 1" for analysis. Note that you cannot directly assign when fetching with the query method.

×××× train.query(qry )["group"] = 1

*** can't be !! ***

It's a little roundabout way at that time, but I think the following is probably better.

qry = 'company_id == 1088 & meter_reading > 20000'
target_idx = train.query(qry).index
train["group"].loc(target_idx) = 1

Readability isn't that bad, and above all

train.loc(target_idx) 

With this, I think it is good to be able to confirm whether it is squeezed well.

However, it seems that there are restrictions on the characters that can be entered in the query, That may be a problem someday.

By the way, the query example on this site is helpful. https://ohke.hateblo.jp/entry/2019/01/12/230000 (engine = python, etc.)


 tmp_q = "name_ns == @t_name & year == @t_year "

Referenced articles https://qiita.com/kurumen-b/items/45b60299f0893a537f2a https://qiita.com/mwmsnn/items/6a464865759231aa888d

Further notes With recent pandas, it seems that writing like iloc is not recommended after narrowing down the columns. This notation seems to be more important. If you want to go back and forth between row numbers, column numbers, row labels, column names, see below https://note.nkmk.me/python-pandas-get-loc-row-column-num/

Recommended Posts

Best practices for messing with data with pandas
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Data processing tips with Pandas
Versatile data plotting with pandas + matplotlib
[Pandas] I tried to analyze sales data with Python [For beginners]
Deploy functions with Cloud Pak for Data
Tips for plotting multiple lines with pandas
Try converting to tidy data with pandas
Working with 3D data structures in pandas
Best practices for Django views.py and urls.py (?)
Example of efficient data processing with PANDAS
Pandas basics for beginners ③ Histogram creation with matplotlib
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Try to aggregate doujin music data with pandas
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Read pandas data
Implement "Data Visualization Design # 3" with pandas and matplotlib
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
Interactively visualize data with TreasureData, Pandas and Jupyter.
Lambda function deploy best practices with CircleCI + Lamvery
Make holiday data into a data frame with pandas
Basics of pandas for beginners ② Understanding data overview
A memorandum of method often used when analyzing data with pandas (for beginners)
Save pandas data in Excel format to data assets with Cloud Pak for Data (Watson Studio)
Data analysis for improving POG 1 ~ Web scraping with Python ~
Get Amazon RDS (PostgreSQL) data using SQL with pandas
Masks are useful for searching within Pandas data frames
How to convert horizontally held data to vertically held data with pandas
Be careful when reading data with pandas (specify dtype)
Summary of pre-processing practices for Python beginners (Pandas dataframe)
Data analysis environment construction with Python (IPython notebook + Pandas)
How to extract non-missing value nan data with pandas
Personal best practices for VS Code-fronted Python development environments
[For recording] Pandas memorandum
Quickly visualize with Pandas
Processing datasets with pandas (1)
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Merge datasets with pandas
Visualize data with Streamlit
Learn Pandas with Cheminformatics
Reading data with TensorFlow
Data Augmentation with openCV
Normarize data with Scipy
Data analysis with Python
LOAD DATA with PyMysql
Notes on handling large amounts of data with python + pandas
Ingenuity to handle data with Pandas in a memory-saving manner
Get data from analytics API with Google API Client for python
Best practices for dynamically handling LINE Flex Messages in Django
[Stock price analysis] Learning pandas with fictitious data (002: Log output)
A collection of methods used when aggregating data with pandas