[PYTHON] 100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (Second Half)

A record of solving the problems in the second half of Chapter 2. The execution result of UNIX command is also shown.

</ i> 15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import sys

if len(sys.argv) == 3:
    N = int(sys.argv[1])
    f = open(sys.argv[2])
    lines = f.readlines()[::-1][0:N][::-1]
    for i in xrange(N):
        print lines[i].strip()
    f.close()
else:
    print "please input \'N\' and \'FILENAME\'"

# (python problem15.py 5 hightemp.txt)
#=>39 Hatoyama, Saitama Prefecture.9    1997-07-05
#=>Toyonaka 39, Osaka.9    1994-08-08
#=>Yamanashi Prefecture Otsuki 39.9    1990-07-19
#=>39 Tsuruoka, Yamagata Prefecture.9    1978-08-03
#=>Aichi Prefecture Nagoya 39.9    1942-08-02

Invert the array that read the file for each line and get N lines. Realized by inverting it again and outputting it in order. (Ignored readability and wrote this process in one line)

tail -n 5 hightemp.txt

#=> (Output is the same as above)

</ i> 16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import sys

if len(sys.argv) == 3:
    N = int(sys.argv[1])
    f = open(sys.argv[2])
    rows = f.readlines()
    n, mod = divmod(len(rows), N)
    if mod != 0:
        n += 1
    idx = 0
    for i in xrange(N):
        filename = "split_%s.txt" % (i + 1)
        g = open(filename, "w")
        for j in xrange(n):
            try:
                g.write(rows[idx + j])
            except:
                break
        idx += n
        g.close()
    f.close()

else:
    print "please input \'N\' and \'FILENAME\'"

# python problem16.py 5 hightemp.txt
#=> (split_1.txt〜split_5.Output to txt)

In order to divide the original file into N, calculate the number of file lines n after the division. After that, the required number of lines is output to each file.

split -l 5 hightemp.txt out.

#=> (out.aa, out.ab, ..., out.Output to ae file)

</ i> There is a difference in the meaning of the arguments between the python script and the command, but is this the correct answer ... <i class = "fa fa-question" -circle "> </ i>

</ i> 17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import sys

if len(sys.argv) == 2:
    f = open(sys.argv[1])
    str_set = set()
    for line in f.readlines():
        str_set.add(line.split()[0])
    f.close()
    for s in str_set:
    	print s
else:
    print "please input \'FILENAME\'"

# (python problem17.py hightemp.txt)
#=>Aichi prefecture
#=>Yamagata Prefecture
#=>Gifu Prefecture
#=>Chiba
#=>Saitama
#=>Kochi Prefecture
#=>Gunma Prefecture
#=>Yamanashi Prefecture
#=>Wakayama Prefecture
#=>Ehime Prefecture
#=>Osaka
#=>Shizuoka Prefecture
cat col1.txt | sort | uniq

#=> (Output is the same as above)

</ i> 18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import sys

if len(sys.argv) == 2:
    f = open(sys.argv[1])
    lines = f.readlines()
    sorted_lines = sorted(lines, key=(lambda x:float(x.split()[2])), reverse=True)
    g = open("sorted_hightemp.txt", "w")
    for line in sorted_lines:
        g.write(line.strip() + "\n")
    g.close()
    f.close()
else:
    print "please input \'FILENAME\'"

# (python problem18.py hightemp.txt)
#=>Kochi Prefecture Ekawasaki 41 2013-08-12
#=>40 Kumagaya, Saitama Prefecture.9	2007-08-16
#=>40 Tajimi, Gifu Prefecture.9	2007-08-16
#=>Yamagata 40 Yamagata.8	1933-07-25
#=>Yamanashi Prefecture Kofu 40.7	2013-08-10
#=>Wakayama Prefecture Katsuragi 40.6	1994-08-08
#=>Shizuoka Prefecture Tenryu 40.6	1994-08-04
#=>40 Katsunuma, Yamanashi Prefecture.5	2013-08-10
#=>40 Koshigaya, Saitama Prefecture.4	2007-08-16
#=>Gunma Prefecture Tatebayashi 40.3	2007-08-16
#=>40 Kamisatomi, Gunma Prefecture.3	1998-07-04
#=>Aisai 40, Aichi Prefecture.3	1994-08-05
#=>Chiba Prefecture Ushiku 40.2	2004-07-20
#=>40 Sakuma, Shizuoka Prefecture.2	2001-07-24
#=>40 Uwajima, Ehime Prefecture.2	1927-07-22
#=>40 Sakata, Yamagata Prefecture.1	1978-08-03
#=>Gifu Prefecture Mino 40 2007-08-16
#=>Gunma Prefecture Maebashi 40 2001-07-24
#=>39 Mobara, Chiba.9	2013-08-11
#=>39 Hatoyama, Saitama Prefecture.9	1997-07-05
#=>Toyonaka 39, Osaka.9	1994-08-08
#=>Yamanashi Prefecture Otsuki 39.9	1990-07-19
#=>39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
#=>Aichi Prefecture Nagoya 39.9	1942-08-02

</ i> I was writing the program because it is in descending order in the title of the problem, but the original file is also in descending order in the third column ... </ i> Am I misreading the gist of the problem? ?? </ i> If the original file is in reverse order, you can set reverse = False.

sort -r -k 3 hightemp.txt

#=> (Output omitted)

</ i> If you want to sort in ascending order, you can eliminate the -r option.

</ i> 19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import sys

if len(sys.argv) == 2:
    f = open(sys.argv[1])
    lines = f.readlines()
    count = {}
    for line in lines:
        l = line.split()[0]
        if count.has_key(l):
            count[l] += 1
        else:
            count[l] = 1
    for k, v in sorted(count.items(), key=(lambda x:x[1]), reverse=True):
        print k
else:
    print "please input \'FILENAME\'"

# (python problem19.py hightemp.txt)
#=>Yamagata Prefecture
#=>Saitama
#=>Gunma Prefecture
#=>Yamanashi Prefecture
#=>Aichi prefecture
#=>Gifu Prefecture
#=>Chiba
#=>Shizuoka Prefecture
#=>Kochi Prefecture
#=>Wakayama Prefecture
#=>Ehime Prefecture
#=>Osaka

Create a dict with the value of the first column as the key and the number of occurrences as the value, and output in descending order of the number of appearances based on it.

cut -f 1 hightemp.txt | sort  | uniq -c | sort -n -r | less

#=> (Output omitted)

Recommended Posts