Unzip a lot of ZIP-compressed files with Linux commands to UTF8 and stick them together

I came across a group of files that seemed to be troublesome at first glance, but I was able to decompress all the files, convert them to UTF8, and combine them into one file more easily than I expected. is.

--Many ZIP files --There are multiple CSV files in the ZIP file --The file name after decompression is SJIS --The contents of the decompressed file are written in SJIS, and the line feed code is CRLF. ――But the CSV file size per one is not so big --There are items such as registration NO (UID), registration date and time, and registrant ID in the record, and there is no duplication of records, but there are duplicate records. > It is a common story that you manually download from the system and download duplicates.

environment

Lubuntu 16.04(64bit) --Memory 4G

Backup is absolute

Make a backup so that you can accidentally delete or overwrite the original zip file. It may not be okay if you think it's okay, so be sure to do it.

First defrost

First, unzip the many ZIP files together. This can be easily done with the find command. By adding -j, the directory structure is ignored and the file is decompressed. At that time, the -B option is added to prevent overwriting even if the file name is duplicated.

It seems that you can convert the file name to SJIS by specifying -O sjis with ʻunzip`, but I have encountered many times when it does not work, so I do not use it here.

`Unzip ZIP file`


mkdir work #Creating a working directory
cd work
find ../ -name '*.zip' -exec unzip -j -B {} \;

You can also use ls and xargs instead of find.

`Unzip ZIP file (using ls and xargs)`


mkdir work #Creating a working directory
cd work
ls ../*.zip | xargs -I{} unzip -j -B {}

What if the ZIP file is nested?

When I made a ZIP file, there was another ZIP file in it. That is a common story, but in that case, decompress it again as follows. After unzipping it, when the ZIP file comes out again, run it again until you feel comfortable. The more you run it, the more files you have and the more duplicates you have, but since we plan to delete duplicate records later, I think it doesn't matter how many times you run it at this point. ..

If you do not specify it like * .zip, files other than ZIP format can not be decompressed and an error will occur, so I think that it is okay, and because the file name was garbled in SJIS in the past, it does not hit with * .zip Because there was something.

`Unzip the ZIP file (if the ZIP file is nested)`


find ./ -type f -exec unzip -j -B {} \;

There are other ways to use ls and xargs, but I think it's better to choose a method that is easy for you to understand rather than writing it short.

`Unzip ZIP file (using ls and xargs)`


ls * | xargs -I{} unzip -j -B {}

Delete unnecessary ZIP files

If you unzip it completely, it will be troublesome if the ZIP file remains, so delete it. If you can judge by the extension, you can use rm, but this time I will try to find the file whose contents are in ZIP format and delete it, assuming that it is not the case.

This isn't too difficult either, and it's surprisingly easy to do with file and grep. Check the target file just in case before deleting it. If there are many files, it will take a lot of time, so if you can judge by the file name, I think it is better to delete it with rm * .zip.

`Confirmation of the target ZIP file`


file * | grep 'Zip archive'

`Perform ZIP file deletion`


file * | grep 'Zip archive' | sed 's/: *Zip archive.*//' | xargs -I{} rm {}

Rename the file to UTF8

If the file name after decompression is SJIS, I think that the characters are garbled, so change all the file names to UTF8. It's easy with the convmv command. If you don't have convmv, install it with sudo apt install convmv.

By the way, in this case, the files will be combined into one later, so the file name can be anything. Even if you can't convert it well, you don't have to stick to it, and even if the characters are garbled, it's almost okay.

`Convert filename to UTF8`


convmv -f sjis -t utf8 --notest *

Change the contents of the file to UTF8 and change the line feed code to LF

I'm using the familiar nkf. If it is not installed, install it with sudo apt install nkf. nkf is convenient because you can convert the character code and change the line feed code at the same time. If there are many files, it will be "Too many open files", so I use find, but if there are few files, I think that it can be written asnkf ~ *.

`Convert file contents to UTF8 (Part 1),Use find)`


find ./ -type f -exec nkf -Lu -w --overwrite {} \;

`Convert file contents to UTF8 (Part 2),If there are few files)`


nkf -Lu -w --overwrite *

`Convert file contents to UTF8 (Part 3),Using ls and xargs)`


ls * | xargs -I{} nkf -Lu -w --overwrite {}

Combine all files, remove duplicates, compress to one file and save

It may not be so difficult if you know that "the item name is included in the first line of all CSV files, so you can't simply attach them with cat ". It's like sticking the data (header) on the first line and the data with the header on the first line removed from each file.

`Gzip compression by sticking all files together and removing duplicates`


(cat * | head -1; ls * | xargs -I{} sed '1d' {} | sort | uniq) | gzip > all.csv.gz

By the way, if you can use "M command" of "NYSOL", you can write as follows. NYSOL's M command can handle not only CSV files but also large files with a small amount of memory, so it is quite convenient to use.

`Gzip compression by sticking all files together with M command to remove duplicates`


mcat i=* | muniq k='*'  | mfldname -q | gzip > all.csv.gz

It is gzip-compressed, but when you want to check the size after decompression, it looks like the following.

`Check size after unfolding`


zcat all.csv.gz | wc -l -c

`Confirmation of size after expansion (execution result)`


$ zcat all.csv.gz | wc -l -c
 748654 229449752

When you unzip the gzip-compressed file, you can see that the number of records is about 740,000 and the file size is about 200M.

Why compress?

Is it because smaller files are easier to handle? It depends on the file contents, but I think that the size will be about 1/8, and if it is R data.table :: fread, it can be written like data.table :: fread ("zcat all.csv.gz"). Isn't it difficult to use even if it is gzip-compressed?