[PYTHON] Batch processing forms with Azure Form Recognizer

There is a service called Azure Form recognizer. https://azure.microsoft.com/ja-jp/services/cognitive-services/form-recognizer/

It is an excellent one that reads the form nicely and extracts the target data. Since there is also an API, I wrote a Python script that can process multiple forms at once https://github.com/yosukearaiMS13/formrecognizerbatch/blob/master/fy.py

The contents of the script and how to use it are explained below.

The contents of the script

The script is made by extending the sample in the document https://docs.microsoft.com/ja-jp/azure/cognitive-services/form-recognizer/quickstarts/python-labeled-data?tabs=v2-0

The script consists of 4 sections https://github.com/yosukearaiMS13/formrecognizerbatch/blob/master/fy.py

fr.py



# Configurations:Various setting parameters

#Post Analysis target pdf section
##Post all the data to be analyzed to Form recognizer once

# Get analyze results section
##Get the analysis result (including the extracted data) of the data posted earlier.

#Csv output section of extraction result
##The extraction result is output. Remove extra whitespace and replace unreliable extract values
##(If it is below the threshold value, the extracted value is not adopted, and the reliability is used instead.[]Output in box)
##Is doing

Get analyze results and csv output section of the extraction results parse the json returned by the Form recognizer. Click here for json format https://github.com/Azure-Samples/cognitive-services-REST-api-samples/blob/master/curl/form-recognizer/Invoice_1.pdf.ocr.json

The format of the output csv is as follows. --First column: Form file name to be analyzed --Second and subsequent columns: All labels (tags) set in the analysis model and the corresponding extracted values csv.png

The API used in each section is as follows --Post Analysis target pdf: Analyze Form

How to use the script

1. Environment

Win10 Enterprise, Python 3.8.5, IDE is optional

2. Data extraction preparation

(* From the prerequisite work to data extraction preparation 1, this Qiita article is also helpful)

--Prerequisite work: Do the following first -[Create Form Recognizer Resource](https://docs.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/label-tool?tabs=v2-0#create-a-form -recognizer-resource) -Create Azure blob (Create storage account-> Create container) -[Azure blob settings (Create Shared Access Signature)](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-storage-explorer#work-with-shared -access-signatures) (Use the Azure Portal Storage Explorer menu for convenience) + image.png

--Data extraction preparation 1 (implemented only for the first time) --Store training data for model creation in Azure blob: Place at least 5 files (invoice_1 ~ 5.pdf in this case) in the following form (xx.json is a file created later, so ignore it here)

--Label (tagging) tool settings:

fr.py


## Configurations
endpoint = r"https://xxxxx.cognitiveservices.azure.com/"
apim_key = "xxxxx"
model_id = "xxxxx"
sourceDir = r"C:\xxxxx\*"
confidence_setting = 0.9 # 0~1.Not adopted if reliability is below this value

--endpoint: Form Recognizer endpoint --apim_key: Form Recognizer key 1 or 2![Image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/327770/21e50af5-135e-a847-6040 -233601fdfbbf.png) --sourceDir: Describe the location of the form file to be analyzed with the full path --confidence_setting: Set a value from 0 to 1 (* As a script specification, if the reliability is less than this value, the extracted value is not adopted, but instead the reliability evaluation value is output in []. Is)

--Data extraction preparation 2 (implemented every time a label is added or modified) --Label the read training data (form) with the label (tagging) tool (https://fott.azurewebsites.net/). .. Train when you're done and generate a model

fr.py


## Configurations
endpoint = r"https://xxxxx.cognitiveservices.azure.com/"
apim_key = "xxxxx"
model_id = "xxxxx"
sourceDir = r"C:\xxxxx\*"
confidence_setting = 0.9 # 0~1.Not adopted if reliability is below this value

--Model_id: Set the Model ID obtained above

3. Data extraction

--Place the form file to be analyzed in sourceDir --Run fr.py --Data extraction result csv is output to the same folder as the script

4. Constraints, etc.

――It is a file format of training and analysis target forms, but I have only tried PDF --I am making it based on the current version v2.0 of Form recognizer. When using it in other versions, I think that it is necessary to change the API URL as appropriate and respond to the json format change returned by Form recognizer.

Recommended Posts

Batch processing forms with Azure Form Recognizer
Image processing with MyHDL
Processing datasets with pandas (1)
Processing datasets with pandas (2)
Image processing with Python
Parallel processing with multiprocessing
Image Processing with PIL