Windows → linux Tips for bringing in data


There are quite a few situations in which ordinary people receive files created with Excel or Word on a Windows machine and process information on linux. I have summarized the basics of the data conversion procedure that will be required at that time.

Bring Japanese name files from windows to linux

Here, the Japanese name file is a file in which so-called double-byte characters are used in the file name.

Extract the zip file created under windows environment under Linux environment

unzip -O cp932 Archive containing Japanese name

cp932 is a character code standard extended by microsoft to Shift JIS.

reference: Actually not scary CP932

Convert file names written in Shift JIS to UTF-8

convmv -f cp932 -t utf-8 * --notest

For ubuntu, the convmv command is not installed by default, so you need to do ʻapt install convmv` in advance.

If a file with a Japanese name is unzipped from zip on linux using the "standard" procedure, it seems that it will be converted to UTF-8 with inappropriate garbled characters, and even if you try to convmv after that, " "It has been processed" and it is refused. There is no choice but to give up and re-extract from the original zip file according to the above procedure.

Convert character code and line feed (Shift jis → UTF-8 and CRLF → LF)

Convert the text of Shift jis to UTF-8, convert the line feed code from CR / LF which is the standard of windows to LF, and write the result to a new file.

iconv -f cp932 -t utf-8 Target file name| sed 's/\r//g' >Output destination file name

I also considered a method to process all the files in the current directory at once. Let's write out the converted contents by creating a file with the same name in the subdirectory ʻutf8`. Use bash loop processing.

[ -d utf8 ] || mkdir utf8
for a in *
  do iconv -f cp932 -t utf-8 $a | sed 's/\r//g' > utf8/$a

Recommended Posts

Windows → linux Tips for bringing in data
Tips for data analysis ・ Notes
pykintone on Windows Subsystem for Linux
Until Windows Subsystem for Linux (WSL) is installed in Windows and fish is installed
Windows Subsystem for Linux is not displayed
Tips for dealing with binaries in Python
WSL2 (Windows Subsystem for Linux) installation procedure
How to install Windows Subsystem For Linux
Tips for building large applications in Flask
Tips for making small tools in python
Open a ZIP created on Windows in Linux
Tips for using Realsense SR300 on MacBook in 2020
Tips for hitting the ATND API in Python
Display candlesticks for FX (forex) data in Python
pyenv for linux
Tips for using ElasticSearch in a good way
Create your own Big Data in Python for validation
Summary of useful tips for Linux terminals ☆ Updated daily
“Learn Linux in 5 Days” (Download Linux Ebooks Here! For Free)
When Windows Subsystem for Linux (WSL) cannot be started
Approximately 200 latitude and longitude data for hospitals in Tokyo
Stop thinking for use in data analysis competition LightGBM
Tool for creating training data for object detection in OpenCV
Library for measuring execution time in Linux C applications
Cross development environment (developing programs for windows on linux)
Linux permissions in Java
Sampling in imbalanced data
Linux (WSL) on Windows
[For memo] Linux Part 2
Linux, Windows proxy settings
virtualenvwrapper in windows environment
What is Linux for?
Linux command for self-collection
Seurat in Linux (installation)
Install Python (for Windows)
Tips for handling variable length inputs in deep learning frameworks
How to set up Ubuntu for Windows Subsystem for Linux 2 (WSL2)
Tips for developing apps with Azure Cosmos DB in Python
[Linux] Copy data from Linux to Windows with a shell script
How to implement 100 data science knocks for data science beginners (for windows10 Home)
Tips for coding short and easy to read in Python
[Understand in the shortest time] Python basics for data analysis
Listed data structures in the Linux kernel and their operations
How to implement Python EXE for Windows in Docker container
Get own process name at runtime in C / C ++ (for Linux)
Comfortable LaTeX with Windows Subsystem for Linux and VS Code
[PowerShell] How to search for rows like Linux grep (Windows)