The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.
All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.
Chapter 1 is here.
The environment is Python 3.8.2 and Ubuntu 18.04.
I think that here is easier to understand as a commentary article. I would like the authors to write commentary articles up to Chapter 10.
popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with a UNIX command and check the execution result of the program.
Please download the required dataset from here.
The downloaded file shall be placed under data
.
Count the number of lines. Use the wc command for confirmation.
code
with open('data/popular-names.txt') as f:
print(len(list(f)))
output
2780
Just find the length of the file object. Since the file object is an iterator, it must be listed. If the input is a large enough file, it may not fit in the memory, but in such a case, you can just turn it with a for statement and count it.
code
wc -l < data/popular-names.txt
output
2780
Find the number of lines by specifying the l option in the wc
command. If you give a file name, various extras will be displayed, so give it from the standard input.
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
code
with open('data/popular-names.txt') as f:
for line in f:
line = line.strip()
line = line.replace('\t', ' ')
print(line)
output(First 10 lines)
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
Ida F 1472 1880
Alice F 1414 1880
Bertha F 1320 1880
Sarah F 1288 1880
Each string obtained by turning the file object as an iterator has a newline character at the end, so remove it with strip
(rstrip ('\ n')
may be preferable).
Just replace the tabs with spaces and output.
There is also a method of print (line, end ='')
without removing the newline character with strip
.
code
awk '{gsub("\t", " ", $0); print $0}' data/popular-names.txt
perl -pe 's/\t/ /g' data/popular-names.txt
sed 's/\t/ /g' data/popular-names.txt
expand -t 1 data/popular-names.txt
tr '\t' ' ' < data/popular-names.txt
The output is the same as that of Python, so I will omit it. (The same applies thereafter)
There are various UNIX commands and I can't remember them easily.
Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
code
with open('data/popular-names.txt') as f, \
open('result/col1.txt', 'w') as g, \
open('result/col2.txt', 'w') as h:
for line in f:
line = line.strip()
pref, city, _, _ = line.split('\t')
print(pref, file=g)
print(city, file=h)
I wrote it obediently.
result(col1.First 10 lines of txt)
Mary
Anna
Emma
Elizabeth
Minnie
Margaret
Ida
Alice
Bertha
Sarah
result(col2.First 10 lines of txt)
F
F
F
F
F
F
F
F
F
F
code
cut -f 1 data/popular-names.txt > col1.txt
cut -f 2 data/popular-names.txt > col2.txt
It's easy with the cut
command. ʻAwk'{print $ 1}'` is fine.
Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
All you have to do is open the two files separately, but since it's a big deal, use contextlib.ExitStack
to implement it so that you can handle any number of files.
For context manager → https://docs.python.org/ja/3/library/stdtypes.html#typecontextmanager
code
from contextlib import ExitStack
code
files = ['result/col1.txt', 'result/col2.txt']
with ExitStack() as stack:
files = [stack.enter_context(open(filename)) for filename in files]
for lines in zip(*files):
x = [line.strip() for line in lines]
x = '\t'.join(x)
print(x)
result(First 10 lines)
Mary F
Anna F
Emma F
Elizabeth F
Minnie F
Margaret F
Ida F
Alice F
Bertha F
Sarah F
code
paste result/col1.txt result/col2.txt
It's easy with the paste command.
Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.
Personally, I like to receive standard input and command line arguments with ʻargparseand
fileinput`, but this time I want to be able to run all the code on jupyter notebook, so I use command line arguments not. (I feel kindness to "etc." in the question sentence)
code
N = 5
with open('data/popular-names.txt') as f:
lst = range(N)
for _, line in zip(lst, f):
print(line, end='')
output
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
You can do the same with the head
command.
code
head -n 5 data/popular-names.txt
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
I think it's okay to put all the standard inputs in a list and retrieve the last 5 elements, but if it's a large file it may not fit in memory, so I'll use a queue.
code
from collections import deque
code
N = 5
queue = deque([], 5)
with open('data/popular-names.txt') as f:
for line in f:
queue.append(line)
for line in queue:
print(line, end='')
output
Benjamin M 13381 2018
Elijah M 12886 2018
Lucas M 12585 2018
Mason M 12435 2018
Logan M 12352 2018
You can do the same with the tail
command.
code
tail -n 5 data/popular-names.txt
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
I think this is because there aren't many situations where a file is divided into N lines, but depending on the implementation of the split command, there may not be N divisions at lines. It may be a GNU extension.
code(5 divisions)
split -d -nl/5 data/popular-names.txt result/shell5.
result(Check the number of lines with wc)
587 2348 11007 result/shell5.00
554 2216 11010 result/shell5.01
556 2224 11006 result/shell5.02
540 2160 11007 result/shell5.03
543 2172 10996 result/shell5.04
2780 11120 55026 total
I also implemented it in python so that it behaves the same as the code for this GNU extension (https://github.com/coreutils/coreutils/blob/master/src/split.c).
code
def split_string_list(N, lst):
chunk_size = sum([len(x) for x in lst]) // N
chunk_ends = [chunk_size * (n + 1) - 1 for n in range(N)]
i = 0
acc = 0
out = []
for chunk_end in chunk_ends:
tmp = []
while acc < chunk_end:
tmp.append(lst[i])
acc += len(lst[i])
i += 1
out.append(tmp)
return out
def split_file(N, filepath, outprefix):
with open(filepath) as f:
lst = list(f)
lst = split_string_list(N, lst)
for i, lines in enumerate(lst):
idx = str(i).zfill(2) #Omission
with open(outprefix + idx, 'w') as f:
f.write(''.join(lines))
split_file(5, 'data/popular-names.txt', 'result/python5.')
First, count the total number of characters and decide the cutting position (chunk_ends
) so that the number of characters is as uniform as possible.
Then, it takes a line until it exceeds each element of chunk_ends, and when it exceeds, it outputs it to a file.
result(Check the number of lines with wc)
587 2348 11007 result/python5.00
554 2216 11010 result/python5.01
556 2224 11006 result/python5.02
540 2160 11007 result/python5.03
543 2172 10996 result/python5.04
2780 11120 55026 total
result
diff result/python5.00 result/shell5.00
diff result/python5.01 result/shell5.01
diff result/python5.02 result/shell5.02
diff result/python5.03 result/shell5.03
diff result/python5.04 result/shell5.04
I got the same result.
Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.
code
names = set()
with open('data/popular-names.txt') as f:
for line in f:
name = line.split('\t')[0]
names.add(name)
names = sorted(names)
for name in names:
print(name)
result(First 10 lines)
Abigail
Aiden
Alexander
Alexis
Alice
Amanda
Amelia
Amy
Andrew
Angela
The first column is added to the set in order, sorted and output. (* The version of python is 3.8.2.)
code
cut -f 1 data/popular-names.txt | sort -s | uniq
Take only the first column and sort to remove duplicates. If you forget to enter sort, it will be strange. In addition, the s option is added to ensure stable sorting according to python.
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
code
with open('data/popular-names.txt') as f:
lst = [line.strip() for line in f]
lst.sort(key = lambda x : -int(x.split('\t')[2]))
for line in lst[:10]:
print(line)
output(First 10 lines)
Linda F 99689 1947
Linda F 96211 1948
James M 94757 1947
Michael M 92704 1957
Robert M 91640 1947
Linda F 91016 1949
Michael M 90656 1956
Michael M 90517 1958
James M 88584 1948
Michael M 88528 1954
You can specify the sorting criteria by specifying the key of the sort
function.
code
sort -nrsk 3 data/popular-names.txt
You can do it only with the sort
command. It's easy.
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
Use collections.Counter
.
code
from collections import Counter
code
cnt = Counter()
with open('data/popular-names.txt') as f:
for line in f:
name = line.split('\t')[0]
cnt.update([name])
lst = cnt.most_common()
lst.sort(key=lambda x:(-x[1], x[0]))
for name, num in lst[:10]:
print(name)
output
James
William
John
Robert
Mary
Charles
Michael
Elizabeth
Joseph
Margaret
You can either pass the list as is to the Counter
object, or you can pass it little by little with ʻupdate ().
most_common ()` will arrange them in descending order.
code
cut -f 1 data/popular-names.txt | sort | uniq -c | sort -nrsk1 | awk '{print $2}'
If you add the -c
option when taking ʻuniq`, it will count how many there are. Finally, sort by number to get the desired result.
Language processing 100 knocks 2020 Chapter 3: Regular expressions
Recommended Posts