[PYTHON] Best practices for messing with data with pandas

In machine learning, after preprocessing of data, while having a hypothesis, Do you play with the data? Do you play with it? I think there is a phase, At that time, how to freely manipulate pandas is I think it will be important.

I myself have a little programming experience and database knowledge,

*** Pandas DataFrame [] ← This is too complicated! !! *** ***

Especially, it is difficult to narrow down by conditions.

train[train["company_id"] == 1088]["meter_reading"]

It's confusing at this point, but if this were train_weather_df, it would be ruined.

train_weather_df[train_weather_df[“company_id”]==1088][“meter_readings”]

Moreover, when there are two conditions, more scary things happen ...

So to do something a little complicated, I think it's better to use ***. Query () ***.

train.query(qry)["meter_reading"]

After narrowing down, I think there is something like trying to make this "Group 1" for analysis. Note that you cannot directly assign when fetching with the query method.

××××　train.query(qry )["group"] = 1

*** can't be !! ***

It's a little roundabout way at that time, but I think the following is probably better.

qry = 'company_id == 1088 & meter_reading > 20000'
target_idx = train.query(qry).index
train["group"].loc(target_idx) = 1

Readability isn't that bad, and above all

train.loc(target_idx)

With this, I think it is good to be able to confirm whether it is squeezed well.

However, it seems that there are restrictions on the characters that can be entered in the query, That may be a problem someday.

By the way, the query example on this site is helpful. https://ohke.hateblo.jp/entry/2019/01/12/230000 (engine = python, etc.)

Addition I think there are cases where you want to use variables in a query. If you prefix it with @, it will be recognized as a variable! !! This is insanely important.


 tmp_q = "name_ns == @t_name & year == @t_year "

Referenced articles https://qiita.com/kurumen-b/items/45b60299f0893a537f2a https://qiita.com/mwmsnn/items/6a464865759231aa888d

Further notes With recent pandas, it seems that writing like iloc is not recommended after narrowing down the columns. This notation seems to be more important. If you want to go back and forth between row numbers, column numbers, row labels, column names, see below https://note.nkmk.me/python-pandas-get-loc-row-column-num/