Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.
-Chapter 1 -Chapter 2, Part 1 is continued.
Only here, I think it would be good to insert a little explanation of UNIX commands as well as Python.
For detailed options of UNIX commands, check the man
command or ITpro's website and you will be able to study properly!
hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
16.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 16.py
import sys
def split_file(filename, number_of_parts):
with open(filename) as f:
lines = f.readlines()
if len(lines) % number_of_parts != 0:
raise Exception("Undividable by N=%d" % number_of_parts)
else:
number_of_lines = len(lines) / number_of_parts
for i in range(number_of_parts):
with open("split%s.txt" % str(i), "w") as w:
w.writelines(lines[number_of_lines * i: number_of_lines * (i + 1)])
if __name__ == '__main__':
try:
split_file(sys.argv[1], int(sys.argv[2]))
except Exception as err:
print("Error:", err)
The tricky part when splitting into N is when the original number of lines is not divisible by N.
I think there are various ways to deal with the situation when it is not divisible, but this time I will try to raise
ʻError`.
For normal exception handling
try:
#Processing that is likely to cause an error
# ex) a = 10/0 (division by zero, ZeroDivisionError)
except [Error name]:
#What to do when an error occurs
# ex) print("Unable to divide by 0")And
# except [Some error] as err:If you write, the error content is stored in e
finally:
#Processing that must be performed regardless of the occurrence of an error
It can be described in the form of. It works even if you do not specify the type of Error picked up in the except part, but it is recommended to specify it as much as possible. This is because it becomes a hotbed of bugs because it picks up (forces) any Error and makes it work.
Furthermore, this time, I am raise
by myself.
If I do my best, I can implement my own exceptions, but I still don't understand Python Class, so I decided to use the existing ʻException` this time.
To generate ʻError, write
raise Exception ("Error message") . Then, ʻError
is forcibly generated at that location.
This time, an error is generated when the number of lines in the file is not divisible by the natural number N
.
if name == 'main':
I also first saw the description ʻif name =='main': and it became"? ". By writing this, when this program is called directly, this ʻif
statement will be executed. Conversely, if you call it indirectly from another program by ʻimport etc., the inside of the ʻif
statement is not executed.
In the article in Qiita, this description was basically omitted, but due to exception handling, it will be described together this time.
//When divided into three, 24 ÷ 3=A file is generated every 8 lines
$ split -l 8 hightemp.txt
$ ls
xaa xab xac hightemp.txt
$ cat xaa
Kochi Prefecture Ekawasaki 41 2013-08-12
(Omitted...)
40 Katsunuma, Yamanashi Prefecture.5 2013-08-10
$ cat xab
40 Koshigaya, Saitama Prefecture.4 2007-08-16
(Omitted...)
40 Sakata, Yamagata Prefecture.1 1978-08-03
$ cat xac
Gifu Prefecture Mino 40 2007-08-16
(Omitted...)
Aichi Prefecture Nagoya 39.9 1942-08-02
Originally, split
specifies the number of lines to be split, so it seems difficult to directly specify N split. I couldn't think of a technique for dividing on the command line, so I asked the input person to calculate 24 ÷ N
and specify the option.
Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.
17.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 17.py
import sys
filename = sys.argv[1]
prefectures = set()
with open(filename) as f:
line = f.readline()
while line:
prefectures.add(line.split()[0])
line = f.readline()
for pref in prefectures:
print(pref)
set
is a set that does not allow duplication, so it is perfect for this time.
$ cut -f 1 hightemp.txt | sort | uniq
Chiba
Saitama
Osaka
(Omitted...)
Shizuoka Prefecture
Kochi Prefecture
Wakayama Prefecture
You said that you used sort
and ʻuniq, but I also used
cut because it was an obstacle. Intuitively, it seems that ʻuniq
can be used alone, but ʻuniqworks on the assumption that it is sorted, so
sort` is essential.
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
18.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 18.py
import sys
with open(sys.argv[1]) as f:
lines = f.readlines()
for line in sorted(lines, key=lambda x: x.split()[2], reverse=True):
print line,
The sort display in the last for
statement was previously covered in Chapter 1. At that time, I used it without explanation, but this time I will explain it here.
lambda
It's easy if you want to sort a specific string of numbers, but this time you have to sort by the number that exists in the middle of the string. Therefore, the sorting criteria are specified using the anonymous function lambda
.
It seems difficult to hear anonymous functions or lambda
, but it is almost the same as the functions that have appeared so far except that they are put together in one line.
#Calculate 3 times 2
#Function definition by def
def double(x):
return x * 3
print double(2) # 6
#Function definition by lambda
double = lambda x: x * 3
print double(2) #6
In this implementation, it means that the function that returns the third column of the character string received as an argument is expressed by lambda
.
You can write ad hoc functions quickly, so it is very convenient if you can use it well.
sorted()
As the name suggests, it is a function that sorts the list. The list received as an argument is sorted directly instead of the return value, so be careful when handling it.
You can use key
to sort by, and reverse
to switch between ascending and descending order.
$ sort -r -k 3 hightemp.txt
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Tajimi, Gifu Prefecture.9 2007-08-16
40 Kumagaya, Saitama Prefecture.9 2007-08-16
(Omitted...)
Toyonaka 39, Osaka.9 1994-08-08
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02
-r
specifies descending order, and -k 3
specifies the third line.
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
19.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 19.py
import sys
from collections import defaultdict
filename = sys.argv[1]
prefectures = defaultdict(int)
with open(filename) as f:
line = f.readline()
while line:
prefectures[line.split()[0]] += 1
line = f.readline()
for k, v in sorted(prefectures.items(), key=lambda x: x[1], reverse=True):
print(k)
defaultdict
The troublesome thing when working with a dictionary is that you can't write something like dict [key] + = 1
for a key that doesn't exist yet because the initial value doesn't exist.
It is defaultdict
that solves this problem, and it is possible to declare variables in the form of setting initial values.
Since we specified ʻint` this time, the initial value is 0, but you can also set an initial value with a more complicated form. It is very useful.
$ cut -f 1 hightemp.txt | sort | uniq -c | sort -r | cut -c 6-
Gunma Prefecture
Yamanashi Prefecture
Yamagata Prefecture
(Omitted...)
Kochi Prefecture
Ehime Prefecture
Osaka
--cut -f 1
Cut out the first column (prefecture name)
--sort
If you do not sort, ʻuniq will not work properly, so sort it. --ʻUniq -c
Deletes duplicates and outputs a unique line with the number of occurrences before deletion.
--sort -r
Since it is "highest number of occurrences", it sorts in descending order based on the number of occurrences.
--cut -c 6-
Since extra spaces and character strings are included at the beginning of the output result line, the 6th and subsequent characters are displayed and the spaces are deleted.
This is also a combination of commands that have already appeared.
Continue to Chapter 3.
Recommended Posts