[R] [Python] Memo to read multiple csv files in multiple zip files

Introduction

I encountered a scene where I wanted to read multiple csv files stored in multiple zip files under a certain folder at once, so I made a note.

Folder configuration example

When the folder structure is like this, I want to read the csv files in each zip file at once and store them in the list.

input/
    ┣ zip_files/
    ┃         ┣ test1.zip/
    ┃         ┃         ┣  test1_1.csv
    ┃         ┃         ┣  test1_2.csv
    ┃         ┃            ...
    ┃         ┣ test2.zip/
    ┃         ┃         ┣  test2_1.csv
    ┃         ┃         ┣  test2_2.csv
    ┃                      ...

R code

You can access the files inside without opening the zip with the unzip and unz functions. I wanted to add it to the list as an append of python, but I wasn't sure, so I compromised below.

library(tidyverse)
library(data.table)

zip_list <- list.files("zip_files")

# function of read csv files in zip files 
get_csv <- function(zip_list){  
  
  csv_list <- list()
  zip_lists <- list()
  
  # Loop through the list of files
  for(j in 1:length(zip_list)) { 
    
    # Create list of files
    file <- unzip(paste0("zip_files/", zip_list[j]), list = TRUE) 
    
    for(i in 1:length(file)){
      # If a file is a csv file, unzip it and read the data
      if(grepl("csv", file[i,1])) {

        print(paste0('reading following file...', file[i,1]))
        csv_files <- read_csv(unz(paste0("zip_files/", zip_list[j]), file[i,1]), 
                              col_names=TRUE)
        ########################
        # Add Some process.
        ########################
        csv_list[[i]] <- csv_files
        zip_lists[[j]] <- csv_list
      }
    }
  }
  return(zip_lists)
}

system.time(csvs <- get_csv(zip_list))

Python code

The zipfile module allows you to access the inside without unzipping the zip file. Since enumerate returns the index and element of the object to be turned by the for statement, it is convenient when adding some processing. (In the example below, it is the same even if it is not used)

import os
import zipfile
import glob
import pandas as pd
import time

df_list = list()
start = time.time()

for i, zips in enumerate(zip_list):
    zip_f = zipfile.ZipFile(zips)  
    file_list = zip_f.namelist() # file names of csv files in zip
    
    for j, files in enumerate(file_list):
        print('reading following file...' + zips + '/' + files)
        
        df = pd.read_csv(zip_f.open(files))
        ########################
        # Add Some process.
        # If use i and j too.
        ########################
        df_list.append(df)

elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")

Recommended Posts

[R] [Python] Memo to read multiple csv files in multiple zip files
How to read CSV files in Pandas
Convert UTF-8 CSV files to read in Excel
2 ways to read all csv files in a folder
How to read csv containing only integers in Python
Transpose CSV files in Python Part 1
Read files in parallel with Python
Effective Python Memo Item 11 Use zip to process iterators in parallel
How to do R chartr () in Python
Send email to multiple recipients in Python (Python 3)
Read all csv files in the folder
Read Python csv and export to txt
Read and write JSON files in Python
Csv in python
Memo # 4 for Python beginners to read "Detailed Python Grammar"
How to read a CSV file with Python 2/3
Convert FBX files to ASCII <-> BINARY in Python
Summary of how to import files in Python 3
Memo # 3 for Python beginners to read "Detailed Python Grammar"
Memo # 1 for Python beginners to read "Detailed Python Grammar"
Reading and writing CSV and JSON files in Python
Read CSV files uploaded to Flask without saving
Handle zip files with Japanese filenames in Python 3
Memo # 2 for Python beginners to read "Detailed Python Grammar"
How to get the files in the [Python] folder
Memo # 7 for Python beginners to read "Detailed Python Grammar"
Memo # 6 for Python beginners to read "Detailed Python Grammar"
Memo # 5 for Python beginners to read "Detailed Python Grammar"
Various ways to read the last line of a csv file in Python
Read DXF in python
Read Python csv file
[Python] Reading CSV files
Convert the image in .zip to PDF with Python
How to download files from Selenium in Python in Chrome
How to add page numbers to PDF files (in Python)
[Python] A memo to write CSV vertically with Pandas
Deep nesting in Python makes it hard to read
How to retrieve multiple arrays using slice in python.
I want to use the R dataset in python
Remove headings from multiple format CSV files with python
Pass dataframe containing True / False from Python to R in csv format (pd.DataFrame-> tbl_df)
Read csv with python pandas
To flush stdout in Python
Login to website in Python
Read Euler's formula in Python
A simple way to avoid multiple for loops in Python
How to define multiple variables in a python for statement
Tips for coding short and easy to read in Python
Read Namespace-specified XML in Python
How to develop in a virtual environment of Python [Memo]
Read Outlook emails in Python
Batch convert all xlsx files in the folder to CSV files
Allow Python to select strings in input files from folders
Speech to speech in python [text to speech]
Write to csv with Python
Decompress multiple compressed files (Python)
Avoid multiple loops in Python
[Python] Use this to read and write wav files [wavio]
How to develop in Python
Prohibit multiple launches in python
Upload multiple files in Flask