What to do with creating / updating an Excel sheet (A) and acquiring / processing data (CSV, etc.) to be added to that sheet (B)?

A. Creating / updating sheets

Introduction

There are the following processing patterns.

--After adding a sheet / data to a new object in the workbook, save it as = New --Read an existing workbook, add data, and give a new name = Use template --Read an existing workbook, add data, and give it the same name = Update existing sheet

Generally, it is OK if you use the openpyxl library (There are other libraries that are faster, but they only support reading = you can not write (update) to the object of the workbook, etc. There are restrictions).

1. Creating a sheet

1-1. Creating a workbook object (wb)

-① "After adding data to the object, save it as a name = create new" Library import

      from openpyxl import Workbook
      wb = Workbook()

-② "Read an existing book, add data, and give a new name = use template" -③ "After reading the existing book and adding the data, give it the same name = update the existing sheet"

      from openpyxl import Workbook
      wb = px.load_workbook(self.template_filepath)

1-2. Creating worksheet objects (ws)

      ws = wb.worksheets[sheetnum] / wb.worksheets[self.sheetname]

2. Preparing and adding data to be added to the worksheet (described later)

--2-1. Data acquisition ―― 2-2. Data processing --2-3. Substitute data sequentially into worksheet objects (wb / ws / cell) (Later, specify the file name for the object and actually save it).

4. Save workbook

Save the data added to the worksheet objects in that workbook to the actual workbook.

--Pattern ③, if you specify an existing file path --If you specify a new file path, it will be ① or ②

wb.save(self.output_filepath)

B. Preparing and adding data to be added to the worksheet

Introduction

The implementation patterns include the following

--After retrieving the data by the default method, format the data and then add it to the worksheet objects sequentially. --After acquiring the data in a specific library, batch export it in the same library --After retrieving data in a specific library, format the data and then add it to the worksheet objects in sequence.

1. Get the data

There are the following patterns as this method

-① Use normal with open -② Use a library specialized for large-scale data processing

If there is extra data in the data or if aggregation is required, this is used for preprocessing. In this library, data is acquired from CSV etc. in a format called a data frame. The data acquired as a data frame is in a form that is easy to aggregate and calculate. If you just want to process the acquired data and export it instead of updating the worksheet or using the existing file as a template, this is all you need.

Note that memory is often a problem, so refer to the following. https://www.sejuku.net/blog/74447 Other related https://qiita.com/gangun/items/f97c40f5540f8011c252

① Case of using normal open

Get sequentially

wizh open(self.source_filepath, 'r', encoding='utf-8') as file:
    while True:
        line = file.readline()
        row = next(line) #This suffers from the story of the next item
        ws.append(row) #This is the story of the next item

--How to skip the first row (column)

https://teratail.com/questions/107027

with open(self.source_filepath, 'r', encoding='utf-8') as file:
    next(file)

(2) Case of using a library specialized for data processing

The type of data first obtained by this method is called a data frame. I usually use pandas. Reading data with dask is fast. Parallel processing of multiprossess seems to be the best. However, it is also in this order that there is no habit of using it and it is hard to stumble. Details such as the aggregation method are diverse and are basically omitted. If you do not pay attention to the data type, an error will occur.

(1) Import library

import pandas as pd / import dask as dd

(2) Get the data

You can get data of type dataframe from the data source with the code below

df = pd.read_csv(self.source_filepath ,encoding=self.source_file_encoding)
df = dd.read_excel(self.source_filepath ,encoding=self.source_file_encoding)

--Note: How to skip the first row (column)

We can use the header and skiprows option to tell Pandas not to use the column labels already in our spreadsheet. Note that both header=None and skiprows=1 must be set in this case, as the first row of actual data will be skipped if the header option is missing. https://wellsr.com/python/python-pandas-read_excel-to-import-excel-file-into-dataframe/

read_csv(self.source_filepath ,encoding=self.source_file_encoding, header=None, skiprows=1)

--When acquiring sequentially (reduction of memory usage) --pandas chunksize option --Specify chunksize = *** → You can get chunks of data of the type called chunk for the specified number.

Since it's a big deal, I also experimented with different patterns for Pandas chunk size. From the conclusion, the processing was completed fastest around chunk size = 30,000 for 10 million rows. https://qiita.com/gangun/items/17155a8b59079e37b075

read_csv(self.source_filepath ,encoding=self.source_file_encoding, chunksize=30000)

--When acquiring while performing parallel distributed processing (improving memory usage efficiency) - dask --multiprossess (install has a habit)

reference: https://qiita.com/hoto17296/items/586dc01aee69cd4915cc https://qiita.com/simonritchie/items/e174f243bc03fb25462e https://qiita.com/simonritchie/items/1ce3914eb5444d2157ac

--Case to get data from workbook object

reference: https://soudegesu.com/post/python/cell-excel-with-openpyxl/

· Row / column read


for col in ws.iter_cols(min_row=2):
    ...
for row in ws. iter_rows(min_row=2):
    ...

・ Acquisition of cell unit (Omitted)

3. Add data to the workbook object

This is the part that intersects with "A. Creating / updating sheet". Because it is necessary to add the acquired / processed data to the sheet / cell object created by OpenPyXl.

The data here includes not only the actual cell value but also the data type and format (layout). I will not make a note here in detail. When searching for these, there are many explanations such as processing in the order of row → cell, or even if a column is specified, processing is finally performed in cell units. However, regarding the layout, when an existing workbook (template) is loaded with OpenPyXl, it may be possible to keep the original format as a whole if conditional formatting is set on the sheet. This makes it possible, for example, to color only the subtotal rows.

--Case where data is acquired in chunk type, but data is added row by row

for chunk in chunks:
  rows = chunk.values
  for row_data in rows_data:
    row_data = row_data.tolist()
    ws.append(row_data)

--Case where data is acquired in data frame type and data is added row by row


for i in dataframe_to_rows(df, index=False, header=True):
    ws.append(row)

--Case of adding in cell units

Example 1 https://gist.github.com/bisco/a65e71c8ba45337f91174e6ae3c139f9

Example 2 Line breaks in cells https://www.relief.jp/docs/openpyxl-line-feed.html

ws['A1'].value = 'AAA\nBBB'
ws['A1'].alignment = openpyxl.styles.Alignment(wrapText=True)

Example 3

rows_data = df.values.tolist()
for i, row_data in enumerate(rows_data):
    for j, cell_data in enumerate(row_data):
        if type(cell_data) is int:
            ws.cell(row=i+1, column=j+1).number_format = '#'#currentFormat
    ws.append(row_data)

--When retrieving data in matrix format [[...], [...]] and adding it row by row

for row in matrix:
  ws.append(row)

--Case to add in a specific column

reference: https://medium.com/ns%E9%9B%91%E8%A8%98%E5%B8%B3/6-%E5%88%97%E3%82%92%E8%BF%BD%E5%8A%A0%E3%81%97%E3%81%A6%E6%95%B0%E5%BC%8F%E3%82%92fill%E3%81%99%E3%82%8B-9ce61f29baa8

  ws['B2'] = '=vlookup(A2,C:C,1,false)'
Processing cells in a specific column
          for row, cellObj in enumerate(list(ws.columns)[col_num]: #col_num:Column index
            n= '=vlookup(A%d,C:C,1,false)' %(row+1)
          cellObj.value = n

--When outputting the data frame as it is (new creation or overwrite update)

df.reset_index().to_csv(self.output_filepath, encoding=self.output_file_encoding, index=False)
df.reset_index().to_excel(self.output_filepath, encoding=self.output_file_encoding, index=False)
#reset_index()Is optional

There are still articles that were helpful, so I plan to add them in my spare time.

(Miscellaneous notes) Data update pattern from CSV data acquisition / processing to Excel by Python

A. Creating / updating sheets

Introduction

1. Creating a sheet

1-1. Creating a workbook object (wb)

1-2. Creating worksheet objects (ws)

2. Preparing and adding data to be added to the worksheet (described later)

4. Save workbook

B. Preparing and adding data to be added to the worksheet

Introduction

1. Get the data

① Case of using normal open

Get sequentially

(2) Case of using a library specialized for data processing

(1) Import library

(2) Get the data

3. Add data to the workbook object