[PYTHON] Normalize the file that converted Excel to csv as it is.

Sendai Central Wholesale Market data is finally available in csv.

But

seika.csv


Sendai City Central Wholesale Market Daily Report (Fruit and Fruit Department),,,,,,,,,,,,,,,,,,,,,,,,,,,,
January 13, 3rd year of Reiwa (Wednesday) Weather(Sunny),,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,Sendai City Central Wholesale Market,,,,,,,,,,,,,,,,,
Sales volume,(1) Vegetables,310 t,,,,,,,,,TEL (022)232−8121〜2,,,,,,,,,,,,,,,,,
,(2) Fruit,94 t,,,,,,,,,FAX (022)232−8144,,,,,,,,,,,,,,,,,
Total sales volume,404 ,t,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Market price list by major item (1) Vegetables,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Product name,Origin,unit,,High price,Medium price,Low price,Product name,Origin,unit,,High price,Medium price,Low price,,,,,,,,,,,,,,,
Daikon L,Chiba,10,Kg,"1,728","1,512","1,296",Tomato M,Kumamoto,4,Kg,"1,404",−     ,"1,296",,,,,,,,,,,,,,,
Daikon L,Kanagawa,10,Kg,"1,728",−     ,"1,296",Cherry tomato M,Chiba,200,g,140,−     ,108,,,,,,,,,,,,,,,
Turnip 3L,Chiba,1,bundle,−     ,194,−     ,Cherry tomato M,Kumamoto,3,Kg,"1,512",−     ,"1,296",,,,,,,,,,,,,,,
Turnip LL,Chiba,1,bundle,216,173,86,Bell pepper M,Ibaraki,150,g,108,97,92,,,,,,,,,,,,,,,
Carrot M,Ibaraki,10,Kg,"1,944","1,836","1,728",Peaman M,Kochi,150,g,108,97,86,,,,,,,,,,,,,,,
Carrot M,Chiba,10,Kg,"1,944","1,728","1,512",Sayaendo A,Kagoshima,1,Kg,"1,944",−     ,"1,620",,,,,,,,,,,,,,,
Burdock M,Aomori,4,Kg,"1,620","1,404",756,Sweet potato L,Ibaraki,5,Kg,−     ,"1,512",−     ,,,,,,,,,,,,,,,
Chinese cabbage,Ibaraki,15,Kg,756,432,432,Sweet potato L,Chiba,5,Kg,−     ,"1,512",−     ,,,,,,,,,,,,,,,
Mizuna,Miyagi,150,g,−     ,86,−     ,Bareisho L,Hokkaido,10,Kg,"1,728",−     ,"1,620",,,,,,,,,,,,,,,
Mizuna,Ibaraki,200,g,−     ,86,−     ,Make-in L,Hokkaido,10,Kg,−     ,"2,376",−     ,,,,,,,,,,,,,,,
Komatsuna,Miyagi,250,g,108,86,86,Taro LL,Niigata,5,Kg,"3,240","1,620",540,,,,,,,,,,,,,,,
Cabbage L,Chiba,10,Kg,−     ,"1,296",−     ,Long LL,Aomori,10,Kg,"3,240","2,700","2,160",,,,,,,,,,,,,,,
Cabbage L,Aichi,10,Kg,"1,404","1,296","1,080",Onion L,Hokkaido,20,Kg,"1,728",−     ,"1,620",,,,,,,,,,,,,,,
spinach,Miyagi,200,g,162,119,97,Raw shiitake,Miyagi,100,g,162,130,108,,,,,,,,,,,,,,,
Small onion,Miyagi,100,g,130,108,97,Nameko S,Miyagi,100,g,54,−     ,43,,,,,,,,,,,,,,,
Long onion L,Miyagi,5,Kg,"2,700","2,700","2,484",Nameko S,Yamagata,100,g,−     ,65,−     ,,,,,,,,,,,,,,,
Long onion L,Saitama,5,Kg,"4,320","3,780","3,456",Enoki,Miyagi,100,g,−     ,38,−     ,,,,,,,,,,,,,,,
Bent green onions,Miyagi,4,Kg,"1,620",−     ,"1,404",Enoki,Miyagi,200,g,77,77,65,,,,,,,,,,,,,,,
Asatsuki,Fukushima,4,Kg,−     ,"2,160",−     ,Enoki,Niigata,200,g,70,−     ,65,,,,,,,,,,,,,,,
Thread mitsuba,Miyagi,60,g,−     ,130,−     ,Bunashimeji,Miyagi,100,g,97,−     ,86,,,,,,,,,,,,,,,
Thread mitsuba,Miyagi,100,g,108,−     ,76,Bunashimeji,Niigata,170,g,130,−     ,108,,,,,,,,,,,,,,,
Crowndaisy M,Miyagi,150,g,162,162,140,Chicken egg AM,Iwate,10,Kg,"1,728",−     ,"1,620",,,,,,,,,,,,,,,
Seri M,Miyagi,100,g,216,162,162,,,,,,,,,,,,,,,,,,,,,,
Chinese chive L,Fukushima,100,g,108,108,86,,,,,,,,,,,,,,,,,,,,,,
Chinese chive M,Kochi,100,g,−     ,130,−     ,,,,,,,,,,,,,,,,,,,,,,
Cerli LL,Aichi,10,Kg,"2,376",−     ,"2,160",,,,,,,,,,,,,,,,,,,,,,
Lettuce L,Kagawa,10,Kg,"3,348",−     ,"3,240",,,,,,,,,,,,,,,,,,,,,,
Bok choy,Miyagi,250,g,108,65,65,,,,,,,,,,,,,,,,,,,,,,
Yukina,Miyagi,200,g,−     ,76,−     ,,,,,,,,,,,,,,,,,,,,,,
Yukina,Miyagi,250,g,97,−     ,76,,,,,,,,,,,,,,,,,,,,,,
Cucumber M,Kochi,5,Kg,"2,160","2,160","1,944",,,,,,,,,,,,,,,,,,,,,,
Cucumber S,Miyazaki,5,Kg,"2,376",−     ,"2,160",,,,,,,,,,,,,,,,,,,,,,
Pumpkin 7 balls,Mexico,10,Kg,"2,160",−     ,"1,728",,,,,,,,,,,,,,,,,,,,,,
Eggplant M,Kochi,5,Kg,"2,916","2,160","1,944",,,,,,,,,,,,,,,,,,,,,,
Tomato M,Miyagi,4,Kg,"1,296",−     ,"1,080",,,,,,,,,,,,,,,,,,,,,,
(2) Fruit,,,,,,,market conditions,,,,,,,,,,,,,,,,,,,,,
Product name,Origin,unit,,High price,Medium price,Low price,Vegetables,,,,,,,,,,,,,,,,,,,,,
140 lemons,America,17,Kg,"7,020",−     ,"6,480",Total vegetable sales volume decreased by 201t,,,,,,,,,,,,,,,,,,,,,
Gure-36 balls,America,17,Kg,"3,780",−     ,"3,240",Seri M,,,Small cheap,,,,,,,,,,,,,,,,,,
Decopon 18 balls,Wakayama,5,Kg,"3,564","3,240","3,024",Sayaendo A,,,10 to 20% cheaper,,,,,,,,,,,,,,,,,,
Another one,Miyagi,270,g,432,378,356,,,,,,,,,,,,,,,,,,,,,,
Tochiotome,Miyagi,270,g,410,410,378,,,,,,,,,,,,,,,,,,,,,,
Earl melon 6 balls,Kochi,9,Kg,−     ,"9,720",−     ,,,,,,,,,,,,,,,,,,,,,,
banana,Philippines,13,Kg,"3,024",−     ,"2,700",,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,Others in general,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,fruit,,,,,,,,,,,,,,,,,,,,,
,,,,,,,Total fruit sales volume decreased by 62t,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,Overall,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,

And thisExcel pdf.png A terrible thing that seems to have output the Excel file to CSV as it is ...

I can do anything, so I tried to normalize it for the time being.

Normalization

processing_csv.py


import re, datetime
import pandas as pd
# from IPython.display import display

url='http://kei008220.webcrow.jp/seika.csv'

r=pd.read_csv(url,encoding='Shift-JIS', header=None)
#pd.set_option('display.max_rows', 100)

#Verification
r=r.dropna(how="all", axis=1)
#display(r)



#I will put it in by appearance
hyoudai=r.iloc[0,0] #title
nichiji=r.iloc[1,0]   #Date and time: Wednesday, January 13, 3rd year of Reiwa Weather(Sunny)Separated later
sijou=r.iloc[2,11]   #Market name
hanbairyo_vegetable=r.iloc[3,2] #Vegetable sales volume
hanbairyo_fruit=r.iloc[3,2] #Fruit sales volume, unit(t)Want to separate later
hanbairyo_full=r.iloc[5,1]+r.iloc[5,2] #Sales volume, unit(t)For the time being, stick them together

hinshu_kugiri=r[r[0].str.contains('(\d)', na=False)].index #Rows (1) and (2) of the market price table by major item[8,45]
vegetable=r.iloc[hinshu_kugiri[0]+1:hinshu_kugiri[1],:]
fruit=r.iloc[hinshu_kugiri[1]+1:,:].dropna(how="all")

#Create a table of vegetables
veg1=vegetable.iloc[:,:7]
veg2=vegetable.iloc[:,7:]
veg2.columns=[0,1,2,3,4,5,6]
veg=pd.concat([veg1,veg2[1:]],ignore_index=True).dropna(how="all") #The two rows were combined vertically. At that time, the index is changed.
veg[2]=veg[2].str.cat(veg[3], na_rep='')  #Since the unit was in two rows, stick them together
veg=veg.drop(columns=3,axis=1)  #Delete unnecessary columns after sticking
veg.columns=veg.iloc[0,:]  #Specify column name
veg=veg.iloc[1:,:] #Delete the row used for the column name
veg['type']='Vegetables'  #Vegetablesだけじゃないので列を追加 
#display(veg)

#Create a table of fruits
fru=fruit.iloc[:,:7]
fru=fru.dropna(how="all", axis=0)
fru.iloc[:,2]=fru.iloc[:,2].str.cat(fru.iloc[:,3], na_rep='')
fru=fru.drop(columns=3,axis=1)
fru.columns=fru.iloc[0,:]
fru=fru.iloc[1:,:]
fru['type']='fruit'

#display(fru)

#Join
seika=pd.concat([veg,fru], ignore_index=True)
for i in ['High price','Medium price','Low price']:
  seika[i]=seika[i].str.replace(',','')

seika['date']=nichiji.split()[0]
seika['weather']=re.search(r'(?<=weather.)[^\)]+',nichiji).group()

#Reiwa → AD

wareki=re.search(r'(\d+)\D+(\d+)\D+(\d+)',nichiji.split()[0]).groups()
year=str(2000+int(wareki[0])+18)
month=wareki[1]
day=wareki[2]
#print(f'{year}/{month}/{day}')
#Date conversion
seika['date']=datetime.datetime.strptime(year+'/'+month+'/'+day,'%Y/%m/%d')

#csv output
filename="seika_"+year+month+day+".csv"
header=['product_name','area','unit','high_price','middle_price','low_price','category','wareki','weather','date']
seika.to_csv(filename,header=header,index=None,encoding='UTF-8')

result

index product_name area unit high_price middle_price low_price category wareki weather date
0 Daikon L Chiba 10Kg 1728 1512 1296 Vegetables January 13, 3rd year of Reiwa Sunny 2021-01-13
1 Daikon L Kanagawa 10Kg 1728 1296 Vegetables January 13, 3rd year of Reiwa Sunny 2021-01-13
2 Turnip 3L Chiba 1 bundle 194 Vegetables January 13, 3rd year of Reiwa Sunny 2021-01-13
3 Turnip LL Chiba 1 bundle 216 173 86 Vegetables January 13, 3rd year of Reiwa Sunny 2021-01-13
4 Carrot M Ibaraki 10Kg 1944 1836 1728 Vegetables January 13, 3rd year of Reiwa Sunny 2021-01-13
... ... ... ... ... ... ... ... ... ... ...
59 Decopon 18 balls Wakayama 5Kg 3564 3240 3024 fruit January 13, 3rd year of Reiwa Sunny 2021-01-13
60 Another one Miyagi 270g 432 378 356 fruit January 13, 3rd year of Reiwa Sunny 2021-01-13
61 Tochiotome Miyagi 270g 410 410 378 fruit January 13, 3rd year of Reiwa Sunny 2021-01-13
62 Earl melon 6 balls Kochi 9Kg 9720 fruit January 13, 3rd year of Reiwa Sunny 2021-01-13
63 banana Philippines 13Kg 3024 2700 fruit January 13, 3rd year of Reiwa Sunny 2021-01-13

Commentary

--Basically brute force ――The place where the cells are merged will be the leftmost column, so once read the column name by number --Specify the data frame with iloc and read each item --Delete all lines with dropna () as NaN --concat () when combining data frames vertically --Since it is better to match the column names, I changed it with columns = []. --Because I am doing it with colaboratory, I could not use japanera to convert the Japanese calendar, so I get the numbers withre.search ()and convert it to the Christian era etc. --From Reiwa, the Christian era is _ + 18_. --When outputting to csv, I changed the header to English in consideration of importing it into Splunk. --This should make INDEXED_EXTRACTIONS = csv work ...

Summary

Originally there was only pdf, so I think I made progress just by giving CSV: sweat:

It's not listed in Sendai City Open Data Portal, and I wonder if I don't care.

I think that visualization is all right once it has been processed so far.

Recommended Posts

Normalize the file that converted Excel to csv as it is.
Read CSV file with Python and convert it to DataFrame as it is
The one that divides the csv file, reads it, and processes it in parallel
Handle CSV that contains the element you want to parse in the file name
From Excel file to exe and release of tool that spits out CSV
Hackathon's experience that it is most important to understand the feelings of the organizer
I tried to touch the CSV file with Python
Read the csv file and display it in the browser
How to convert Json file to CSV format or EXCEL format
I realized that it is nonsense to use the module without thinking because it is convenient.
Output the output result of sklearn.metrics.classification_report as a CSV file
It is easy to execute SQL with Python and output the result in Excel
Python> __init__.py> Required to handle the specified directory as a package (empty file is acceptable)
[VLC] How to deal with the problem that it is not in the foreground during playback
The story that the private key is set to 600 with chmod
Temporary solution to the problem that the exe file created by PyInstaller is mistaken as a Trojan horse virus by AVAST or AVG
Since the Excel date read by pandas.read_excel was a serial number, I converted it to datetime.datetime
[Python] How to scrape a local html file and output it as CSV using Beautiful Soup
I want to specify a file that is not a character string for logrotate, but is it impossible?
Relive the dream you missed! (However, it is limited to Pepper)
How to paste a CSV file into an Excel file using Pandas
Memo that refers to dictionary values as attributes (reinventing the wheel)
To make sure that the specified key is in the specified bucket in Boto 3
How easy is it to synthesize a drug on the market?
After calling the Shell file on Python, convert CSV to Parquet.
[Python] How to store a csv file as one-dimensional array data
Ruby on Windows Clear file management that is difficult to understand
The story that the guard was confined when the laboratory was converted to IoT
Get the formula in an excel file as a string in Python
How to input a character string in Python and output it as it is or in the opposite direction.
[Python] Write to csv file with Python
When you want to use it as it is when using it with lambda memo
How to put a line number at the beginning of a CSV file
Convert the spreadsheet to CSV and upload it to Cloud Storage with Cloud Functions
I want to identify the alert email. --Is that x a wildcard? ---
What is a C language library? What is the information that is open to the public?
Python script that outputs all records of Oracle table to CSV file
Find out the name of the method that called it from the method that is python
Is it possible to extract the person's profile information from the chat log?
The story that it turns blue when the data read by Pillow is converted so that it can be handled by OpenCV
I thought it was the same as python, and I was addicted to the problem that the ruby interpreter did not start.
I tried to get the information of the .aspx site that is paging using Selenium IDE as non-programming as possible.