What to do with creating / updating an Excel sheet (A) and acquiring / processing data (CSV, etc.) to be added to that sheet (B)?
There are the following processing patterns.
--After adding a sheet / data to a new object in the workbook, save it as = New --Read an existing workbook, add data, and give a new name = Use template --Read an existing workbook, add data, and give it the same name = Update existing sheet
Generally, it is OK if you use the openpyxl library (There are other libraries that are faster, but they only support reading = you can not write (update) to the object of the workbook, etc. There are restrictions).
-① "After adding data to the object, save it as a name = create new" Library import
from openpyxl import Workbook
wb = Workbook()
-② "Read an existing book, add data, and give a new name = use template" -③ "After reading the existing book and adding the data, give it the same name = update the existing sheet"
from openpyxl import Workbook
wb = px.load_workbook(self.template_filepath)
ws = wb.worksheets[sheetnum] / wb.worksheets[self.sheetname]
--2-1. Data acquisition ―― 2-2. Data processing --2-3. Substitute data sequentially into worksheet objects (wb / ws / cell) (Later, specify the file name for the object and actually save it).
Save the data added to the worksheet objects in that workbook to the actual workbook.
--Pattern ③, if you specify an existing file path --If you specify a new file path, it will be ① or ②
wb.save(self.output_filepath)
The implementation patterns include the following
--After retrieving the data by the default method, format the data and then add it to the worksheet objects sequentially. --After acquiring the data in a specific library, batch export it in the same library --After retrieving data in a specific library, format the data and then add it to the worksheet objects in sequence.
There are the following patterns as this method
-① Use normal with open -② Use a library specialized for large-scale data processing
If there is extra data in the data or if aggregation is required, this is used for preprocessing. In this library, data is acquired from CSV etc. in a format called a data frame. The data acquired as a data frame is in a form that is easy to aggregate and calculate. If you just want to process the acquired data and export it instead of updating the worksheet or using the existing file as a template, this is all you need.
Note that memory is often a problem, so refer to the following. https://www.sejuku.net/blog/74447 Other related https://qiita.com/gangun/items/f97c40f5540f8011c252
wizh open(self.source_filepath, 'r', encoding='utf-8') as file:
while True:
line = file.readline()
row = next(line) #This suffers from the story of the next item
ws.append(row) #This is the story of the next item
--How to skip the first row (column)
https://teratail.com/questions/107027
with open(self.source_filepath, 'r', encoding='utf-8') as file:
next(file)
The type of data first obtained by this method is called a data frame. I usually use pandas. Reading data with dask is fast. Parallel processing of multiprossess seems to be the best. However, it is also in this order that there is no habit of using it and it is hard to stumble. Details such as the aggregation method are diverse and are basically omitted. If you do not pay attention to the data type, an error will occur.
import pandas as pd / import dask as dd
You can get data of type dataframe from the data source with the code below
df = pd.read_csv(self.source_filepath ,encoding=self.source_file_encoding)
df = dd.read_excel(self.source_filepath ,encoding=self.source_file_encoding)
--Note: How to skip the first row (column)
We can use the header and skiprows option to tell Pandas not to use the column labels already in our spreadsheet. Note that both header=None and skiprows=1 must be set in this case, as the first row of actual data will be skipped if the header option is missing. https://wellsr.com/python/python-pandas-read_excel-to-import-excel-file-into-dataframe/
read_csv(self.source_filepath ,encoding=self.source_file_encoding, header=None, skiprows=1)
--When acquiring sequentially (reduction of memory usage) --pandas chunksize option --Specify chunksize = *** → You can get chunks of data of the type called chunk for the specified number.
Since it's a big deal, I also experimented with different patterns for Pandas chunk size. From the conclusion, the processing was completed fastest around chunk size = 30,000 for 10 million rows. https://qiita.com/gangun/items/17155a8b59079e37b075
read_csv(self.source_filepath ,encoding=self.source_file_encoding, chunksize=30000)
--When acquiring while performing parallel distributed processing (improving memory usage efficiency) - dask --multiprossess (install has a habit)
reference: https://qiita.com/hoto17296/items/586dc01aee69cd4915cc https://qiita.com/simonritchie/items/e174f243bc03fb25462e https://qiita.com/simonritchie/items/1ce3914eb5444d2157ac
--Case to get data from workbook object
reference: https://soudegesu.com/post/python/cell-excel-with-openpyxl/
· Row / column read
for col in ws.iter_cols(min_row=2):
...
for row in ws. iter_rows(min_row=2):
...
・ Acquisition of cell unit (Omitted)
This is the part that intersects with "A. Creating / updating sheet". Because it is necessary to add the acquired / processed data to the sheet / cell object created by OpenPyXl.
The data here includes not only the actual cell value but also the data type and format (layout). I will not make a note here in detail. When searching for these, there are many explanations such as processing in the order of row → cell, or even if a column is specified, processing is finally performed in cell units. However, regarding the layout, when an existing workbook (template) is loaded with OpenPyXl, it may be possible to keep the original format as a whole if conditional formatting is set on the sheet. This makes it possible, for example, to color only the subtotal rows.
--Case where data is acquired in chunk type, but data is added row by row
for chunk in chunks:
rows = chunk.values
for row_data in rows_data:
row_data = row_data.tolist()
ws.append(row_data)
--Case where data is acquired in data frame type and data is added row by row
for i in dataframe_to_rows(df, index=False, header=True):
ws.append(row)
--Case of adding in cell units
Example 1 https://gist.github.com/bisco/a65e71c8ba45337f91174e6ae3c139f9
Example 2 Line breaks in cells https://www.relief.jp/docs/openpyxl-line-feed.html
ws['A1'].value = 'AAA\nBBB'
ws['A1'].alignment = openpyxl.styles.Alignment(wrapText=True)
Example 3
rows_data = df.values.tolist()
for i, row_data in enumerate(rows_data):
for j, cell_data in enumerate(row_data):
if type(cell_data) is int:
ws.cell(row=i+1, column=j+1).number_format = '#'#currentFormat
ws.append(row_data)
--When retrieving data in matrix format [[...], [...]]
and adding it row by row
for row in matrix:
ws.append(row)
--Case to add in a specific column
reference: https://medium.com/ns%E9%9B%91%E8%A8%98%E5%B8%B3/6-%E5%88%97%E3%82%92%E8%BF%BD%E5%8A%A0%E3%81%97%E3%81%A6%E6%95%B0%E5%BC%8F%E3%82%92fill%E3%81%99%E3%82%8B-9ce61f29baa8
ws['B2'] = '=vlookup(A2,C:C,1,false)'
Processing cells in a specific column
for row, cellObj in enumerate(list(ws.columns)[col_num]: #col_num:Column index
n= '=vlookup(A%d,C:C,1,false)' %(row+1)
cellObj.value = n
--When outputting the data frame as it is (new creation or overwrite update)
df.reset_index().to_csv(self.output_filepath, encoding=self.output_file_encoding, index=False)
df.reset_index().to_excel(self.output_filepath, encoding=self.output_file_encoding, index=False)
#reset_index()Is optional
There are still articles that were helpful, so I plan to add them in my spare time.
Recommended Posts