[PYTHON] [Pandas recommendation] An example of dividing an Excel sheet with hundreds of thousands of lines into n CSVs and assigning serial numbers to each file from 1.

Yeah, let's use Pandas (python)

It's 2020, so I'm going to stop VBA and use Pandas & python to manage Excel in the company. The following is for those who have read the introductory Python article, so detailed explanations are omitted. There is a lot of information about Pandas and python on the web, so if you're a good boy who wants to try Pandas & python, if you have a problem, check it out.

(Additional note for students?) While it was said that compliance would be strengthened ...

In a so-called user company where many people know the name in TV commercials, the query in Heisei's ancient Acesss issues a query to some database and then files containing personal information. Output to Excel ... After that, there is office work such as cutting and pasting locally and using it for some other purpose. In the case of listed companies, it is often said that such work is compliance-wise (an example of compliance-like request is to clarify who has seen personal information). In such a case, if you hurry up and run Access on Windows on a server such as AWS and complete the processing on the server side of the CSV output by scheduled execution, compliance is okay for the time being. In AWS, I run a script in Systems Manager and put a file in S3 ... etc. However, in a company with old Access, there is a VB script attached to it, VBA, which is painful, so it is said that it should be replaced with Pandas & python.

[1] Preparatory work: Excel specific sheet Manual preparation

1-1. Install the library so that you can read Excel with pandas

$ pip install xlwt
$ pip install openpyxl

If you are wondering what pnp is, please google various things from around nkmk here, the iron plate that you do not remove in the Japanese explanation of Pandas & python:

https://note.nkmk.me/python-pandas-to-excel/

1-2. Read the nth sheet of Excel with pandas.

It would be convenient if you could read any one of the Excel sheets quickly. With Pandas, you can quickly convert an Excel sheet to something called DataFrame (df).

n=0 #In pandas, 0 from the right of the Excel sheet,1,2...Specify.
#Example of reading the nth sheet with the file name target
df= pd.read_excel(target, sheet_name=n)

, The Excel sheet is read as a DataFrame (df) where you can extract only a part of it, or delete or add a specific column.

[2] Cut out a part of the Dataframe and renumber the serial numbers from 1.

2-1. Example of serial numbering function

Like this. Note that for serial numbers, Excel is a culture that starts at 1 and pandas is a culture that starts at 0. Please specify the column name you want to assign serial numbers as appropriate.

# new_index_col column(The default value is"Serial number")に1からSerial numberを付与する関数 
def getPartDf(df,start, step=5, debug= True, new_index_col="Serial number"):
    dfx = df[start:start+step].reset_index()
    dfx[new_index_col] = dfx.index +1 #The serial number is given again. If you set this to something other than 1,
    dfy = dfx.set_index(new_index_col)
    return dfy if debug else dfy.drop("index", axis=1)

If you use a DataFrame read from a ridiculous excel file (like me), you'll want to make sure you've read it, so we also have a debug mode that keeps the original index (Note).

2-2. Example of calling a serial numbering function in a for loop

Use the range function to rotate the loop at regular steps (step_num).

core_name = """File after cutting out"""
step_num= 10000
debug = False

# step_Processing to cut out Excel lines in separate files by num.
for i in range(0, len(df), step_num):
    print(f"{i}that's all{i+step_num}Less than")
    savedDf = getPartDf(df,i, step_num, debug)
    if (debug):
        print(savedDf.head())
    fn = f"{out}/{core_name}{i}_{i+step_num -1}.csv"
    savedDf.to_csv(fn, encoding="cp932")

Fractions are handled nicely by pandas.

Execution example: When processing an Excel sheet with hundreds of thousands of rows

If debug = False

Screen output:

0 or more and less than 10000
10000 or more and less than 20000
20000 or more and less than 30000
30,000 or more and less than 40,000
40,000 or more and less than 50,000
50,000 or more and less than 60,000
60,000 or more and less than 70,000
(Abbreviation)
len 836543

Part of the output excel file (what if the extension is not displayed ...): 切り出し後.PNG

By the way, it takes a lot of time to process with a big Excel partner. This article is also said to have been written during that waiting time. ..

Recommended Posts

[Pandas recommendation] An example of dividing an Excel sheet with hundreds of thousands of lines into n CSVs and assigning serial numbers to each file from 1.
Use Pandas to write only the specified lines of the data frame to an excel file
How to paste a CSV file into an Excel file using Pandas