Introduction

Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.

-Chapter 1 -Chapter 2, Part 1 is continued.

Only here, I think it would be good to insert a little explanation of UNIX commands as well as Python. For detailed options of UNIX commands, check the man command or ITpro's website and you will be able to study properly!

Chapter 2: UNIX Command Basics (Repost)

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

Answer in Python

`16.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 16.py

import sys


def split_file(filename, number_of_parts):
    with open(filename) as f:
        lines = f.readlines()

    if len(lines) % number_of_parts != 0:
        raise Exception("Undividable by N=%d" % number_of_parts)
    else:
        number_of_lines = len(lines) / number_of_parts

    for i in range(number_of_parts):
        with open("split%s.txt" % str(i), "w") as w:
            w.writelines(lines[number_of_lines * i: number_of_lines * (i + 1)])

if __name__ == '__main__':
    try:
        split_file(sys.argv[1], int(sys.argv[2]))
    except Exception as err:
        print("Error:", err)

Comments on Python Answers

Exception handling

The tricky part when splitting into N is when the original number of lines is not divisible by N. I think there are various ways to deal with the situation when it is not divisible, but this time I will try to raise ʻError`.

For normal exception handling

try:
	#Processing that is likely to cause an error
	# ex) a = 10/0 (division by zero, ZeroDivisionError)
except [Error name]:
	#What to do when an error occurs
	# ex) print("Unable to divide by 0")And
	# except [Some error] as err:If you write, the error content is stored in e
finally:
	#Processing that must be performed regardless of the occurrence of an error

It can be described in the form of. It works even if you do not specify the type of Error picked up in the except part, but it is recommended to specify it as much as possible. This is because it becomes a hotbed of bugs because it picks up (forces) any Error and makes it work.

Furthermore, this time, I am raise by myself. If I do my best, I can implement my own exceptions, but I still don't understand Python Class, so I decided to use the existing ʻException` this time.

To generate ʻError, write raise Exception ("Error message") . Then, ʻError is forcibly generated at that location. This time, an error is generated when the number of lines in the file is not divisible by the natural number N.

if name == 'main':

I also first saw the description ʻif name =='main': and it became"? ". By writing this, when this program is called directly, this ʻif statement will be executed. Conversely, if you call it indirectly from another program by ʻimport etc., the inside of the ʻif statement is not executed. In the article in Qiita, this description was basically omitted, but due to exception handling, it will be described together this time.

UNIX answer

//When divided into three, 24 ÷ 3=A file is generated every 8 lines
$ split -l 8 hightemp.txt 
$ ls
xaa          xab          xac          hightemp.txt

$ cat xaa
Kochi Prefecture Ekawasaki 41 2013-08-12
(Omitted...）
40 Katsunuma, Yamanashi Prefecture.5	2013-08-10

$ cat xab
40 Koshigaya, Saitama Prefecture.4	2007-08-16
(Omitted...）
40 Sakata, Yamagata Prefecture.1	1978-08-03

$ cat xac
Gifu Prefecture Mino 40 2007-08-16
(Omitted...）
Aichi Prefecture Nagoya 39.9	1942-08-02

Comments on UNIX Answers

Originally, split specifies the number of lines to be split, so it seems difficult to directly specify N split. I couldn't think of a technique for dividing on the command line, so I asked the input person to calculate 24 ÷ N and specify the option.

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

Answer in Python

`17.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 17.py

import sys

filename = sys.argv[1]
prefectures = set()

with open(filename) as f:
    line = f.readline()
    while line:
        prefectures.add(line.split()[0])
        line = f.readline()

for pref in prefectures:
    print(pref)

Comments on Python Answers

set is a set that does not allow duplication, so it is perfect for this time.

UNIX answer

$ cut -f 1 hightemp.txt | sort | uniq
Chiba
Saitama
Osaka
(Omitted...）
Shizuoka Prefecture
Kochi Prefecture
Wakayama Prefecture

Comments on UNIX Answers

You said that you used sort and ʻuniq, but I also used cut because it was an obstacle. Intuitively, it seems that ʻuniq can be used alone, but ʻuniqworks on the assumption that it is sorted, sosort` is essential.

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

Answer in Python

`18.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 18.py

import sys

with open(sys.argv[1]) as f:
    lines = f.readlines()

for line in sorted(lines, key=lambda x: x.split()[2], reverse=True):
    print line,

Comments on Python Answers

The sort display in the last for statement was previously covered in Chapter 1. At that time, I used it without explanation, but this time I will explain it here.

Anonymous function `lambda`

It's easy if you want to sort a specific string of numbers, but this time you have to sort by the number that exists in the middle of the string. Therefore, the sorting criteria are specified using the anonymous function lambda. It seems difficult to hear anonymous functions or lambda, but it is almost the same as the functions that have appeared so far except that they are put together in one line.

#Calculate 3 times 2

#Function definition by def
def double(x):
	return x * 3	
print double(2)	# 6

#Function definition by lambda
double = lambda x: x * 3
print double(2)	#6

In this implementation, it means that the function that returns the third column of the character string received as an argument is expressed by lambda. You can write ad hoc functions quickly, so it is very convenient if you can use it well.

sorted() As the name suggests, it is a function that sorts the list. The list received as an argument is sorted directly instead of the return value, so be careful when handling it. You can use key to sort by, and reverse to switch between ascending and descending order.

UNIX answer

$ sort -r -k 3 hightemp.txt
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Tajimi, Gifu Prefecture.9	2007-08-16
40 Kumagaya, Saitama Prefecture.9	2007-08-16
(Omitted...）
Toyonaka 39, Osaka.9	1994-08-08
Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

Comments on UNIX Answers

-r specifies descending order, and -k 3 specifies the third line.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

Answer in Python

`19.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 19.py

import sys
from collections import defaultdict

filename = sys.argv[1]
prefectures = defaultdict(int)

with open(filename) as f:
    line = f.readline()
    while line:
        prefectures[line.split()[0]] += 1
        line = f.readline()

for k, v in sorted(prefectures.items(), key=lambda x: x[1], reverse=True):
    print(k)

Comments on Python Answers

defaultdict The troublesome thing when working with a dictionary is that you can't write something like dict [key] + = 1 for a key that doesn't exist yet because the initial value doesn't exist. It is defaultdict that solves this problem, and it is possible to declare variables in the form of setting initial values. Since we specified ʻint` this time, the initial value is 0, but you can also set an initial value with a more complicated form. It is very useful.

UNIX answer

$ cut -f 1 hightemp.txt | sort | uniq -c | sort -r | cut -c 6-
Gunma Prefecture
Yamanashi Prefecture
Yamagata Prefecture
(Omitted...）
Kochi Prefecture
Ehime Prefecture
Osaka

Comments on UNIX Answers

--cut -f 1 Cut out the first column (prefecture name) --sort If you do not sort, ʻuniq will not work properly, so sort it. --ʻUniq -c Deletes duplicates and outputs a unique line with the number of occurrences before deletion. --sort -r Since it is "highest number of occurrences", it sorts in descending order based on the number of occurrences. --cut -c 6- Since extra spaces and character strings are included at the beginning of the output result line, the 6th and subsequent characters are displayed and the spaces are deleted.

This is also a combination of commands that have already appeared.

in conclusion

Continue to Chapter 3.

100 Language Processing Knock with Python (Chapter 2, Part 2)

Introduction

Chapter 2: UNIX Command Basics (Repost)

16. Divide the file into N

Answer in Python

16.py

Comments on Python Answers

Exception handling

UNIX answer

Comments on UNIX Answers

17. Difference in the character string in the first column

Answer in Python

17.py

Comments on Python Answers

UNIX answer

Comments on UNIX Answers

18. Sort each row in descending order of the numbers in the third column

Answer in Python

18.py

Comments on Python Answers

Anonymous function lambda

UNIX answer

Comments on UNIX Answers

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Answer in Python

19.py

Comments on Python Answers

UNIX answer

Comments on UNIX Answers

in conclusion

`16.py`

`17.py`

`18.py`

Anonymous function `lambda`

`19.py`