I'm solving 100 knocks on language processing at a study session centered on in-house members, but the answer code and the solution This is a summary of the tricks that I found useful in the process. Most of the content has been investigated and verified by myself, but it also contains information shared by other study group members.
This time I will summarize the basics of UNIX commands, but the preceding @ moriwo and [@segavvy](https://qiita.com/ Since the article segavvy / items / fb50ba8097d59475f760) has already written a fairly detailed explanation, I would like to make the explanation in this article lightly conservative. If you have any questions after reading the following, it is recommended that you take a look at the articles of both parties from the link above.
-Unix commands learned in Chapter 2 of 100 language processing knocks (this article) -Regular expressions learned in Chapter 3 of 100 language processing knocks -Morphological analysis learned in Chapter 4 of 100 language processing knocks
Python
def count_lines():
with open('hightemp.txt') as file:
return len(file.readlines())
count_lines()
Result (Python)
24
UNIX
!wc -l hightemp.txt
Result (UNIX)
24 hightemp.txt
UNIX commands were overwhelmingly concise. By the way, the !
In front of the wc
is used when executing UNIX commands in JupyterLab or Notebook (in some cases, it works without the !
).
Python
def replace_tabs():
with open('hightemp.txt') as file:
return file.read().replace('\t', ' ')
print(replace_tabs())
UNIX
!cat hightemp.txt | sed $'s/\t/ /g'
Result (common to Python and UNIX)
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Yamagata 40 Yamagata.8 1933-07-25
Yamanashi Prefecture Kofu 40.7 2013-08-10
...
Regarding UNIX sed
, I thought it should be noted that \ t
is not recognized as a tab symbol unless the $
symbol is added.
Python
import pandas as pd
def separate_columns():
df = pd.read_csv('hightemp.txt', sep='\t', header=None)
df.iloc[:,0].to_csv('col1.txt', header=False, index=False)
df.iloc[:,1].to_csv('col2.txt', header=False, index=False)
separate_columns()
UNIX
!cut -f 1 hightemp.txt > col1_unix.txt
!cut -f 2 hightemp.txt > col2_unix.txt
If you check the result with ! Head col1.txt col2.txt
, it will be as follows. The same applies when ! Head col1_unix.txt col2_unix.txt
is used.
Result (Python)
==> col1.txt <==
Kochi Prefecture
Saitama
...
==> col2.txt <==
Ekawasaki
Kumagaya
...
Python
def merge_columns():
with open('col1.txt') as col1_file, open('col2.txt') as col2_file, \
open('merge.txt', mode='w') as new_file:
for col1_line, col2_line in zip(col1_file, col2_file):
new_file.write(f'{col1_line.rstrip()}\t{col2_line.rstrip()}\n')
merge_columns()
UNIX
!paste col[1-2].txt > merge_unix.txt
Result (common to Python and UNIX)
Kochi Prefecture Ekawasaki
Kumagaya, Saitama Prefecture
Gifu Prefecture Tajimi
Yamagata Prefecture Yamagata
...
To check the result, use ! Head merge.txt
or! Head merge_unix.txt
and you should get the above output.
Python
def show_head():
n = int(input())
with open('hightemp.txt') as file:
for line in file.readlines()[:n]:
print(line.rstrip())
show_head()
UNIX
!head -3 hightemp.txt
Result (common to Python and UNIX)
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
For Python, if you want to return a list from a function, you might write something like this:
Python
def show_head():
n = int(input())
with open('hightemp.txt') as file:
return [line for line in file.readlines()[:n]]
print(*show_head())
On the other hand, in UNIX, it is difficult to receive an integer on the command line like Python, but if you write an integer after -
, you can specify how many lines to display. As an applied usage, for example, at the end of the answer of 12
UNIX
!cat hightemp.txt | sed $'s/\t/ /g' | head -5
You can also write and display only the first 5 lines.
Python
def show_tail():
n = int(input())
with open('hightemp.txt') as file:
return [line for line in file.readlines()[-n:]]
print(*show_tail())
UNIX
!tail -3 hightemp.txt
Result (common to Python and UNIX)
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02
Almost the same as 14.
Python
import math
def split_file():
n = int(input())
with open('hightemp.txt') as file:
lines = file.readlines()
num = math.ceil(len(lines) / n)
for i in range(n):
with open('split{}.txt'.format(i + 1), mode='w') as new_file:
text = ''.join(lines[i * num:(i + 1) * num])
new_file.write(text)
split_file()
UNIX
!split -n 5 -d hightemp.txt split_unix
Result (Python)
==> split1.txt <==
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Yamagata 40 Yamagata.8 1933-07-25
Yamanashi Prefecture Kofu 40.7 2013-08-10
==> split5.txt <==
Toyonaka 39, Osaka.9 1994-08-08
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02
The above is the output of Python confirmed by ! Head split1.txt split5.txt
.
On the other hand, the above UNIX command does not work in my environment, so I tried this with Colab (Google Colaboratory). It seems that the -n
command is generally provided in Linux (@IT). It doesn't seem to work on the default macOS.
When I executed the above in Colab, 5 files from split_unix00
to split_unix04
were created, but when I tried to add the extension txt
to it, I felt that it would be a little troublesome. .. [@ Moriwo's article](https://qiita.com/moriwo/items/9d2a73a75f543e2ea6af#16-%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3 % 82% 92n% E5% 88% 86% E5% 89% B2% E3% 81% 99% E3% 82% 8B) introduces an implementation example using ʻawk` etc., but I use Python I wondered if it would be easier to read the code.
Python
import pandas as pd
def get_chars_set():
df = pd.read_csv('hightemp.txt', sep='\t', header=None)
return set(df.iloc[:, 0])
print(get_chars_set())
Result (Python)
{'Chiba', 'Saitama', 'Yamagata Prefecture', 'Wakayama Prefecture', 'Shizuoka Prefecture', 'Kochi Prefecture', 'Osaka', 'Gifu Prefecture', 'Gunma Prefecture', 'Ehime Prefecture', 'Yamanashi Prefecture', 'Aichi prefecture'}
UNIX
!sort -u col1_unix.txt
Result (UNIX)
Chiba
Saitama
Osaka
Yamagata Prefecture
...
UNIX commands can also pipe sort and deduplication processes and write ! Sort col1_unix.txt | uniq
.
Python
def sort_rows():
df = pd.read_csv('hightemp.txt', sep='\t', header=None)
df.rename(columns={0: 'Prefect', 1: 'City', 2: 'Temp', 3: 'Date'}, inplace=True)
df.sort_values(by='Temp', inplace=True)
return df
sort_rows()
Result (Python)
Prefect City Temp Date
23 Nagoya, Aichi 39.9 1942-08-02
21 Yamanashi Prefecture Otsuki 39.9 1990-07-19
20 Toyonaka, Osaka 39.9 1994-08-08
...
UNIX
!sort hightemp.txt -k 3
Result (UNIX)
Aichi Prefecture Nagoya 39.9 1942-08-02
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Yamanashi Prefecture Otsuki 39.9 1990-07-19
Toyonaka 39, Osaka.9 1994-08-08
...
I'm not sure if the word "reverse order" means the reverse of the original or the descending order, but I tried to solve it with the latter interpretation. I feel especially strongly about this problem that UNIX commands can be written short.
Python
def count_freq():
df = pd.read_csv('hightemp.txt', sep='\t', header=None)
return df[0].value_counts()
count_freq()
Result (Python)
Gunma Prefecture 3
Yamanashi 3
Yamagata Prefecture 3
Saitama Prefecture 3
UNIX
!cut -f 1 hightemp.txt | sort | uniq -c | sort -r
Result (UNIX)
3 Gunma Prefecture
3 Yamanashi Prefecture
3 Yamagata Prefecture
3 Saitama Prefecture
UNIX commands are a bit longer, but first cut out the first column (cut -f 1
), then find its frequency of occurrence (sort | uniq -c
), and finally in reverse order of frequency of occurrence. I wrote it so that the flow of arranging (sort -r
) is easy to understand.
However, considering the excellence of value_counts ()
in pandas, I think it's easier to understand using Python here.
That's all for this chapter, but if you make a mistake, please comment.
Recommended Posts