After calling the Shell file on Python, convert CSV to Parquet.

What you want to do overall

I need to convert a lot of CSV files to Parquet, Since there is no column name line in the header of the CSV file in the first place, -Add header to CSV file ・ Convert CSV to Parquet I had to create a tool to do these two steps.



Assumptions

The column name added to the CSV header will be the title for the value in the Parquet file. If the header line does not exist and the data suddenly exists, Each title of the output Parquet file will be the data of the first line.

Call Shell from Python

The process of adding a CSV header line could have been written in Python, It was relatively easy to add in Shell, so I created it in Shell and called the file from Python.

qiita.py


import subprocess

# comment
cmd = './add_header.sh'
subprocess.call(cmd, shell=True)

By specifying Shell in subprocess, You can call an external Shell file.

add_header.sh


##!/usr/bin/env bash
for file in `\find from_dir -maxdepth 1 -type f`; do
    gsed -i '1iheader1,header2' $file
done

"1i" is required when calling gsed.

gsed ・ ・ ・ Please install gnu-sed.

■ Execution result CSV file header header1,header2

Convert CSV to Parquet

I had to convert a large number of CSV files existing on S3 to Parquet. All files are downloaded locally.

qiita2.py


import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import glob

from_dir = './from_dir/'
to_dir = './to_dir/'

#from_Read all CSV in dir
files = glob.glob(from_dir + "*", recursive=True)

#Convert one file at a time and to_Store in dir
for file in files:
    path_name = file.split('/')
    df = pd.read_csv(file)
    table = pa.Table.from_pandas(df)
    pq.write_table(table, to_dir + path_name[2] + '.pq')

Read csv file, output pandas Conversion to Parquet is easy with pyarrow

Recommended Posts

After calling the Shell file on Python, convert CSV to Parquet.
[Python] How to convert db file to csv
[Python] Convert csv file delimiters to tab delimiters
Convert XLSX to CSV on the command line
I tried to touch the CSV file with Python
How to convert JSON file to CSV file with Python Pandas
[Python] Write to csv file with Python
Output to csv file with Python
[Python] Convert CSV file uploaded to S3 to JSON file with AWS Lambda
Create a shell script to run the python file multiple times
How to update the python version of Cloud Shell on GCP
How to convert Python to an exe file
Convert psd file to png in Python
Read CSV file with Python and convert it to DataFrame as it is
How to read a CSV file with Python 2/3
[Python] Convert from DICOM to PNG or CSV
Convert svg file to png / ico with Python
Writing logs to CSV file (Python, C language)
Introduction to Python with Atom (on the way)
Convert the character code of the file with Python3
Save the search results on Twitter to CSV.
Various ways to read the last line of a csv file in Python
I stumbled on the character code when converting CSV to JSON in Python
[Python] Scan the inside of the folder including subfolders → Export the file list to CSV
Draw a line / scatter plot on the CSV file (2 columns) with python matplotlib
Python practice Convert the input year to the Japanese calendar
Think about how to program Python on the iPad
Steps to install the latest Python on your Mac
[Python] Convert PDF text to CSV page by page (2/24 postscript)
Read the xml file by referring to the Python tutorial
How to convert Json file to CSV format or EXCEL format
Python script to create a JSON file from a CSV file
How to enjoy Python on Android !! Programming on the go !!
Read Python csv file
Convert financial information of all listed companies for the past 5 years to CSV file
[Python] Open the csv file in the folder specified by pandas
[python] Change the image file name to a serial number
How to switch the configuration file to be read by Python
Change the standard output destination to a file in Python
[Hyperledger Iroha] Notes on how to use the Python SDK
Get only the Python version (such as 2.7.5) on the CentOS 7 shell
[Python] Read the csv file and display the figure with matplotlib
Use pyOCR to convert the description on the card into text
How to deploy the easiest python textbook pybot on Heroku
Batch convert all xlsx files in the folder to CSV files
After enabling the python virtual environment in the batch file, run the python file
Set the fastest python file execution after starting Raspberry Pi.
Save images on the web to Drive with Python (Colab)
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to divide the file into folders with Python
Try to decipher the garbled attachment file name with Python
Convert Excel file to text in Python for diff purposes
[python] Convert date to string
Convert numpy int64 to python int
[Python] Convert list to Pandas [Pandas]
Download the file in Python
Convert HTML to text file
Update python on Mac to 3.7-> 3.8
Convert Scratch project to Python
[Python] Convert Shift_JIS to UTF-8
Write to csv with Python