[PYTHON] Automatic scraping of reCAPTCHA site every day (3/7: xls file processing)

  1. Requirement definition ~ python environment construction
  2. Create a site scraping mechanism
  3. ** Process the downloaded file (xls) to create the final product (csv) **
  4. Create a file download from S3 / file upload to S3 1.2 Implemented captcha
  5. Allow it to be launched in a Docker container
  6. Register for AWS batch

File operations

Since I downloaded the file using selenium up to the last time, Describes the process of acquiring and processing it and saving it again as a csv file.

Get file list

Get all files with a specific pattern in a specific folder! In that case, glob is convenient.

#Get the file list of the regular expression(glob)
file_list = glob.glob(dl_dir+'/*')

Working with Excel files

It seems that there are several libraries for excel operation by python, but it seems convenient to remember one. I use xlrd.

#Working with Excel files
wb = xlrd.open_workbook(file_name) #open xls
sheet_names = wb.sheet_names() #Get a list of sheet names
sheet = wb.sheet_by_name(sheet_names[1]) 
values2 = sheet.col_values(2)
values5 = sheet.col_values(5)
values2.pop(0) #To eliminate the first line ... I wonder if there is a better way
values5.pop(0)
for i in range(len(channels)):
    obj = [
        word,
        someFunction2(values2[i]),
        someFunction5(values5[i])
    ]
    result.append(obj)

Save to csv file

with open(up_dir + '/result-{}.csv'.format(file_name), 'w') as f:
    writer = csv.writer(f)
    writer.writerows(result)

Complete

So far

--When executed, it scrapes the site and downloads the file. --Put the processed product in a specific folder

I was able to do that. Next, I will write about "sending the processed material to S3" and "obtaining the original INPUT (words) from S3".

Recommended Posts

Automatic scraping of reCAPTCHA site every day (3/7: xls file processing)
Automatic scraping of reCAPTCHA site every day (4/7: S3 file processing)
Automatic scraping of reCAPTCHA site every day (2/7: scraping)
Automatic scraping of reCAPTCHA site every day (6/7: containerization)
Automatic scraping of reCAPTCHA site every day (5/7: 2captcha)