[PYTHON] 100 Language Processing Knock UNIX Commands Learned in Chapter 2

Introduction

I'm solving 100 knocks on language processing at a study session centered on in-house members, but the answer code and the solution This is a summary of the tricks that I found useful in the process. Most of the content has been investigated and verified by myself, but it also contains information shared by other study group members.

This time I will summarize the basics of UNIX commands, but the preceding @ moriwo and [@segavvy](https://qiita.com/ Since the article segavvy / items / fb50ba8097d59475f760) has already written a fairly detailed explanation, I would like to make the explanation in this article lightly conservative. If you have any questions after reading the following, it is recommended that you take a look at the articles of both parties from the link above.

series

-Unix commands learned in Chapter 2 of 100 language processing knocks (this article) -Regular expressions learned in Chapter 3 of 100 language processing knocks -Morphological analysis learned in Chapter 4 of 100 language processing knocks

environment

macOS
Python 3.8.1
JupyterLab

code

10. Counting the number of lines

`Python`


def count_lines():
    with open('hightemp.txt') as file:
        return len(file.readlines())

count_lines()

`Result (Python)`

`UNIX`


!wc -l hightemp.txt

`Result (UNIX)`


      24 hightemp.txt

UNIX commands were overwhelmingly concise. By the way, the ! In front of the wc is used when executing UNIX commands in JupyterLab or Notebook (in some cases, it works without the !).

11. Replace tabs with spaces

`Python`


def replace_tabs():
    with open('hightemp.txt') as file:
        return file.read().replace('\t', ' ')
    
print(replace_tabs())

`UNIX`


!cat hightemp.txt | sed $'s/\t/ /g'

`Result (common to Python and UNIX)`


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Yamagata 40 Yamagata.8 1933-07-25
Yamanashi Prefecture Kofu 40.7 2013-08-10
...

Regarding UNIX sed, I thought it should be noted that \ t is not recognized as a tab symbol unless the $ symbol is added.

12. Save the first column in col1.txt and the second column in col2.txt

`Python`


import pandas as pd

def separate_columns():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    df.iloc[:,0].to_csv('col1.txt', header=False, index=False)
    df.iloc[:,1].to_csv('col2.txt', header=False, index=False)

separate_columns()

`UNIX`


!cut -f 1 hightemp.txt > col1_unix.txt
!cut -f 2 hightemp.txt > col2_unix.txt

If you check the result with ! Head col1.txt col2.txt, it will be as follows. The same applies when ! Head col1_unix.txt col2_unix.txt is used.

`Result (Python)`


==> col1.txt <==
Kochi Prefecture
Saitama
...
==> col2.txt <==
Ekawasaki
Kumagaya
...

13. Merge col1.txt and col2.txt

`Python`


def merge_columns():
    with open('col1.txt') as col1_file, open('col2.txt') as col2_file, \
         open('merge.txt', mode='w') as new_file:
        
        for col1_line, col2_line in zip(col1_file, col2_file):
            new_file.write(f'{col1_line.rstrip()}\t{col2_line.rstrip()}\n')

merge_columns()

`UNIX`


!paste col[1-2].txt > merge_unix.txt

`Result (common to Python and UNIX)`


Kochi Prefecture Ekawasaki
Kumagaya, Saitama Prefecture
Gifu Prefecture Tajimi
Yamagata Prefecture Yamagata
...

To check the result, use ! Head merge.txt or! Head merge_unix.txt and you should get the above output.

14. Output N lines from the beginning

`Python`


def show_head():
    n = int(input())

    with open('hightemp.txt') as file:
        for line in file.readlines()[:n]:
            print(line.rstrip())
    
show_head()

`UNIX`


!head -3 hightemp.txt

`Result (common to Python and UNIX)`


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16

For Python, if you want to return a list from a function, you might write something like this:

`Python`


def show_head():
    n = int(input())

    with open('hightemp.txt') as file:
        return [line for line in file.readlines()[:n]]
    
print(*show_head())

On the other hand, in UNIX, it is difficult to receive an integer on the command line like Python, but if you write an integer after -, you can specify how many lines to display. As an applied usage, for example, at the end of the answer of 12

`UNIX`


!cat hightemp.txt | sed $'s/\t/ /g' | head -5

You can also write and display only the first 5 lines.

15. Output the last N lines

`Python`


def show_tail():
    n = int(input())

    with open('hightemp.txt') as file:
        return [line for line in file.readlines()[-n:]]

print(*show_tail())

`UNIX`


!tail -3 hightemp.txt

`Result (common to Python and UNIX)`


Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

Almost the same as 14.

16. Divide the file into N

`Python`


import math

def split_file():
    n = int(input())

    with open('hightemp.txt') as file:
        lines = file.readlines()
        num = math.ceil(len(lines) / n)
        for i in range(n):
            with open('split{}.txt'.format(i + 1), mode='w') as new_file:
                text = ''.join(lines[i * num:(i + 1) * num])
                new_file.write(text)

split_file()

`UNIX`


!split -n 5 -d hightemp.txt split_unix

`Result (Python)`


==> split1.txt <==
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16
Yamagata 40 Yamagata.8	1933-07-25
Yamanashi Prefecture Kofu 40.7	2013-08-10

==> split5.txt <==
Toyonaka 39, Osaka.9	1994-08-08
Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

The above is the output of Python confirmed by ! Head split1.txt split5.txt.

On the other hand, the above UNIX command does not work in my environment, so I tried this with Colab (Google Colaboratory). It seems that the -n command is generally provided in Linux (@IT). It doesn't seem to work on the default macOS.

When I executed the above in Colab, 5 files from split_unix00 to split_unix04 were created, but when I tried to add the extension txt to it, I felt that it would be a little troublesome. .. [@ Moriwo's article](https://qiita.com/moriwo/items/9d2a73a75f543e2ea6af#16-%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3 % 82% 92n% E5% 88% 86% E5% 89% B2% E3% 81% 99% E3% 82% 8B) introduces an implementation example using ʻawk` etc., but I use Python I wondered if it would be easier to read the code.

17. Difference in the character string in the first column

`Python`


import pandas as pd

def get_chars_set():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    return set(df.iloc[:, 0])

print(get_chars_set())

`Result (Python)`


{'Chiba', 'Saitama', 'Yamagata Prefecture', 'Wakayama Prefecture', 'Shizuoka Prefecture', 'Kochi Prefecture', 'Osaka', 'Gifu Prefecture', 'Gunma Prefecture', 'Ehime Prefecture', 'Yamanashi Prefecture', 'Aichi prefecture'}

`UNIX`


!sort -u col1_unix.txt

`Result (UNIX)`


Chiba
Saitama
Osaka
Yamagata Prefecture
...

UNIX commands can also pipe sort and deduplication processes and write ! Sort col1_unix.txt | uniq.

18. Sort each row in descending order of the numbers in the third column

`Python`


def sort_rows():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    df.rename(columns={0: 'Prefect', 1: 'City', 2: 'Temp', 3: 'Date'}, inplace=True)
    df.sort_values(by='Temp', inplace=True)
    return df

sort_rows()

`Result (Python)`


   Prefect  City  Temp        Date
23 Nagoya, Aichi 39.9  1942-08-02
21 Yamanashi Prefecture Otsuki 39.9  1990-07-19
20 Toyonaka, Osaka 39.9  1994-08-08
...

`UNIX`


!sort hightemp.txt -k 3

`Result (UNIX)`


Aichi Prefecture Nagoya 39.9	1942-08-02
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Yamanashi Prefecture Otsuki 39.9	1990-07-19
Toyonaka 39, Osaka.9	1994-08-08
...

I'm not sure if the word "reverse order" means the reverse of the original or the descending order, but I tried to solve it with the latter interpretation. I feel especially strongly about this problem that UNIX commands can be written short.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

`Python`


def count_freq():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    return df[0].value_counts()
    
count_freq()

`Result (Python)`


Gunma Prefecture 3
Yamanashi 3
Yamagata Prefecture 3
Saitama Prefecture 3

`UNIX`


!cut -f 1 hightemp.txt | sort | uniq -c | sort -r

`Result (UNIX)`


3 Gunma Prefecture
3 Yamanashi Prefecture
3 Yamagata Prefecture
3 Saitama Prefecture

UNIX commands are a bit longer, but first cut out the first column (cut -f 1), then find its frequency of occurrence (sort | uniq -c), and finally in reverse order of frequency of occurrence. I wrote it so that the flow of arranging (sort -r) is easy to understand.

However, considering the excellence of value_counts () in pandas, I think it's easier to understand using Python here.

Summary

That's all for this chapter, but if you make a mistake, please comment.