[Merge sort]: https://ja.wikipedia.org/wiki/%E3%83%9E%E3%83%BC%E3%82%B8%E3%82%BD%E3%83%BC%E3% 83% 88 [io.IOBase.readlines]:http://docs.python.jp/3/library/io.html#io.IOBase.readlines
Python scripts running in a Windows environment now need to sort text files.
The requirements are as follows.
--Run in Windows environment --The target file is a CSV file of several GB --The sort key has multiple columns, and there is also a numeric column.
The Windows command prompt has a [sort command] windows-sort that supports sorting huge files. However, unlike the [Linux sort command] linux-sort, you cannot specify a delimiter (-t) or sort as a number (-n).
If the size is small, you can use a Python script to load all the data into memory and call the sorted function, but if the size is large, you may run out of memory.
Since there is no help for it, I decided to implement a sort process corresponding to a huge file in Python.
To sort a huge amount of data, use the concept of [Merge sort] as follows.
I have created a Processing Procedure Slide so that you can see how it works.
Use [io.IOBase.readlines] to split and read the file.
Using readlines with no arguments loads all lines into memory, so using readlines when the file size is large is known as an anti-pattern.
However, readlines can limit the number of bytes / characters to be read by the argument, which can be used to divide and read even a huge text file.
>>> from io import StringIO
>>> f = StringIO("11\n22\n3\n4\n5\n666666\n")
>>> f.readlines(5)
['11\n', '22\n']
>>> f.readlines(5)
['3\n', '4\n', '5\n']
>>> f.readlines(5)
['666666\n']
>>> f.readlines(5)
[]
It seems to be the number of bytes for files opened in binary mode and the number of characters for text mode.
Use heapq.merge for the merge. heapq.merge returns the result of merging multiple sorted iterables.
>>> import heapq
>>> list(heapq.merge([1, 3, 4, 7], [2, 5], [6]))
[1, 2, 3, 4, 5, 6, 7]
It is listed in gist. The Python version is 3.6.
The following is an excerpt of the sort part.
import heapq
import os
from contextlib import contextmanager
from operator import itemgetter
from tempfile import TemporaryDirectory, mktemp
from typing import IO, Callable, List
def large_sort(input_file: IO, output_file: IO, key: Callable=None, reverse: bool=False, limit_chars: int=1024*1024*64):
with TemporaryDirectory() as tmp_dir:
for lines in _read_parts(input_file, limit_chars):
lines = sorted(lines, key=key, reverse=reverse)
_write_part(lines, tmp_dir)
with _open_tmp_files(tmp_dir) as tmp_files:
for row in heapq.merge(*tmp_files, key=key, reverse=reverse):
output_file.write(row)
def _read_parts(input_file, limit_chars):
lines = input_file.readlines(limit_chars)
while lines:
yield lines
lines = input_file.readlines(limit_chars)
def _write_part(lines, tmp_dir):
tmp_filename = mktemp(dir=tmp_dir)
with open(tmp_filename, "w") as tmp_file:
tmp_file.writelines(lines)
return tmp_filename
@contextmanager
def _open_tmp_files(tmp_dir):
filenames = os.listdir(tmp_dir)
files = [open(os.path.join(tmp_dir, filename), "r") for filename in filenames]
try:
yield files
finally:
for file in files:
file.close()
In this requirement, I wanted a function to call from another Python script, so I don't need it, but in the gist source, I made it available from the command line by referring to the sort command of linux.
I made it myself this time, but I will write it because there are other solutions.
--Use a system environment that can use Linux commands such as Cygwin --This time, the execution environment was a non-engineer's PC, so it's hard to say "install Cygwin", so I rejected it. --It seems possible to extract and bundle only sort.exe and necessary dlls. --Use a free sort tool ――I searched lightly, but I couldn't find it that could be used from the command line and met the requirements. ――Maybe you can find it if you search properly --Powershell sort command --It seems to be more sophisticated than the sort command at the command prompt, but it has not been confirmed.
Recommended Posts