[PYTHON] Memorandum (pseudo Vlookup by pandas)

Introduction

https://qiita.com/wellwell3176/items/e6a17609c7398dd7d496

Since the program created above has been improved, I will summarize it.

** Don't enter it on Excel, just calculate everything on python and throw only the result into Excel, right? ** **

I noticed that it was too natural, so I feel that it has become quite compact. I used openpyxl so far, but this time I used pandas. The difference between the two is still not well understood.

9/30 postscript: In the comment section, @nkay presented a more sophisticated method than this method, so I will post it. The output is almost the same (only the "Category + Business" column is additionally generated).

Proposal method


import pandas as pd

df_master = pd.read_excel('/content/drive/My Drive/Colab Notebooks/table.xlsx')
df_test = pd.read_excel('/content/drive/My Drive/Colab Notebooks/data5.xlsx')

out = df_test.merge(df_master, on=['Classification', 'business'], how='left')
out.to_excel('/content/drive/My Drive/Colab Notebooks/data6.xlsx', index=False)

Outcome

image.png Figure: Output Excel (color coded later) It is good because it is possible to search for and output the items that correspond to setting 1 and setting 2 from the classification + business content. There are no unnecessary lines that are unnecessary at the time of final submission, and it is refreshing. https://qiita.com/wellwell3176/items/e6a17609c7398dd7d496 Compared to Fig. 2 in, you can clearly see that the output is at a usable level.

program


import pandas as pd
df_master=pd.read_excel('/content/drive/My Drive/Colab Notebooks/table.xlsx')
df_test=pd.read_excel('/content/drive/My Drive/Colab Notebooks/data5.xlsx')

row_count=len(df_test.index)  #Get table length to determine the number of loops in a for statement

for i in range(0,row_count): #.iterrows didn't work for some reason
  kubun = df_test.at[(i,"Classification")]
  gyomu = df_test.at[(i,"business")]
  kubun_gyomu=(df_master['Classification'] == kubun) & (df_master["business"] == gyomu) #Classificationとbusinessが一致する行をBoolで判定

  row_true=df_master[kubun_gyomu] #Extract only true rows from the data frame
  set1_val=row_true.iat[0,3]
  set2_val=row_true.iat[0,4] #Setting 1/Since the position of 2 is uniquely determined, specify the position numerically. I couldn't do something like the 0th row "setting 2" column.
#I feel that there are too many variables, but if I don't divide it this much, it seems that what I'm doing later will be refreshing, so I divided it.

  df_test.at[(i,"Setting 1")]=set1_val 
  df_test.at[(i,"Setting 2")]=set2_val #こっちの指定はi行目”Setting 2”列が可能。よく分からん。

df_test.to_excel('/content/drive/My Drive/Colab Notebooks/data6.xlsx',index=False)

Error messages and failures I saw during creation

ValueError: At based indexing on an integer index can only have integer indexers

python


for i in df_test.iterrows():
  kubun = df_test.at[(i,"Classification")]
  gyomu = df_test.at[(i,"business")]

-->ValueError: At based indexing on an integer index can only have integer indexers
 #i=[0]From[Number of last lines in the data frame]I thought that it would work by specifying, but it is useless
 #The reason is still unknown. It seems that i does not contain an integer value ...
 #Solved by specifying the length with the range function like the finished product

**. Key error when using at [0, "hogehoge"] **

Failed version


for i in range(0,45):
  kubun = df_test.at[(i,"Classification")] #テストファイルからClassificationの値を引っ張り出す
  kubun_correct=(df_master['Classification'] == kubun) #Search for master file and hook
  row_true=df_master[kubun_correct] #Pull out only the true line
  set1_val=row_true.at[0,"Setting 1"] #I was caught here trying to output the data of the first row "setting 1" column that issued true
  df_test.at[(i,"Setting 1")]=set1_val  #I wanted to substitute the data of the corresponding setting 1 into Excel

-->key error 0
#The cause is.I don't understand that at is a string reference. i=An error occurred at 10 (category = construction).
#Print at that time(kubun_correct)The result of
0     False
1     False
2      True
3      True
#print(row_true)Then it will be as follows
Category Business setting 1 Setting 2
2 Construction Weekday Regular B South
3 Construction Holiday Irregular B North
#At this time, row_true.at[0,"Setting 1"]は、”0”行の”Setting 1”列を読み込みに行くが、0行はあるものの"0"Cannot be referenced because there is no line
#Therefore, like a finished product.Resolved by referencing 0 rows and 3 columns with iat

References

How to get the cell value of Pandas DataFrame https://www.delftstack.com/ja/howto/python-pandas/how-to-get-a-value-from-a-cell-of-a-dataframe/ How to extract only the value from the data extracted by Pandas https://tipstour.net/python-pandas-get-value Extract (select) rows from AND, OR, NOT of multiple conditions with pandas https://note.nkmk.me/python-pandas-multiple-conditions/

Recommended Posts

Memorandum (pseudo Vlookup by pandas)
Pandas memorandum
pandas memorandum
Pandas operation memorandum
Sort by pandas
[For recording] Pandas memorandum
pandas Matplotlib Summary by usage
Standardize by group with pandas
Visualization memo by pandas, seaborn
Memorandum @ Python OR Seminar: Pandas
Manipulating strings with pandas group by
Feature generation with pandas group by