The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

Chapter 1 is here.

The environment is Python 3.8.2 and Ubuntu 18.04.

I think that here is easier to understand as a commentary article. I would like the authors to write commentary articles up to Chapter 10.

Chapter 2: UNIX Commands

popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with a UNIX command and check the execution result of the program.

Please download the required dataset from here.

The downloaded file shall be placed under data.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

`code`


with open('data/popular-names.txt') as f:
    print(len(list(f)))

`output`

Just find the length of the file object. Since the file object is an iterator, it must be listed. If the input is a large enough file, it may not fit in the memory, but in such a case, you can just turn it with a for statement and count it.

`code`


wc -l < data/popular-names.txt

`output`

Find the number of lines by specifying the l option in the wc command. If you give a file name, various extras will be displayed, so give it from the standard input.

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

`code`


with open('data/popular-names.txt') as f:
    for line in f:
        line = line.strip()
        line = line.replace('\t', ' ')
        print(line)

`output(First 10 lines)`


Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
Ida F 1472 1880
Alice F 1414 1880
Bertha F 1320 1880
Sarah F 1288 1880

Each string obtained by turning the file object as an iterator has a newline character at the end, so remove it with strip (rstrip ('\ n') may be preferable). Just replace the tabs with spaces and output.

There is also a method of print (line, end ='') without removing the newline character with strip.

`code`


awk '{gsub("\t", " ", $0); print $0}' data/popular-names.txt
perl -pe 's/\t/ /g' data/popular-names.txt
sed 's/\t/ /g'  data/popular-names.txt
expand -t 1 data/popular-names.txt
tr '\t' ' ' < data/popular-names.txt

The output is the same as that of Python, so I will omit it. (The same applies thereafter)

There are various UNIX commands and I can't remember them easily.

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

`code`


with open('data/popular-names.txt') as f, \
        open('result/col1.txt', 'w') as g, \
        open('result/col2.txt', 'w') as h:
    for line in f:
        line = line.strip()
        pref, city, _, _  = line.split('\t')
        print(pref, file=g)
        print(city, file=h)

I wrote it obediently.

`result(col1.First 10 lines of txt)`


Mary
Anna
Emma
Elizabeth
Minnie
Margaret
Ida
Alice
Bertha
Sarah

`result(col2.First 10 lines of txt)`


F
F
F
F
F
F
F
F
F
F

`code`


cut -f 1 data/popular-names.txt > col1.txt
cut -f 2 data/popular-names.txt > col2.txt

It's easy with the cut command. ʻAwk'{print $ 1}'` is fine.

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

All you have to do is open the two files separately, but since it's a big deal, use contextlib.ExitStack to implement it so that you can handle any number of files. For context manager → https://docs.python.org/ja/3/library/stdtypes.html#typecontextmanager

`code`


from contextlib import ExitStack

`code`


files = ['result/col1.txt', 'result/col2.txt']
with ExitStack() as stack:
    files = [stack.enter_context(open(filename)) for filename in files]
    for lines in zip(*files):
        x = [line.strip() for line in lines]
        x = '\t'.join(x)
        print(x)

`result(First 10 lines)`


Mary	F
Anna	F
Emma	F
Elizabeth	F
Minnie	F
Margaret	F
Ida	F
Alice	F
Bertha	F
Sarah	F

`code`


paste result/col1.txt result/col2.txt

It's easy with the paste command.

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

Personally, I like to receive standard input and command line arguments with ʻargparseandfileinput`, but this time I want to be able to run all the code on jupyter notebook, so I use command line arguments not. (I feel kindness to "etc." in the question sentence)

`code`


N = 5
with open('data/popular-names.txt') as f:
    lst = range(N)
    for _, line in zip(lst, f):
        print(line, end='')

`output`


Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880

You can do the same with the head command.

`code`


head -n 5 data/popular-names.txt

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

I think it's okay to put all the standard inputs in a list and retrieve the last 5 elements, but if it's a large file it may not fit in memory, so I'll use a queue.

`code`


from collections import deque

`code`


N = 5
queue = deque([], 5)
with open('data/popular-names.txt') as f:
    for line in f:
        queue.append(line)
for line in queue:
    print(line, end='')

`output`


Benjamin	M	13381	2018
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018

You can do the same with the tail command.

`code`


tail -n 5 data/popular-names.txt

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

I think this is because there aren't many situations where a file is divided into N lines, but depending on the implementation of the split command, there may not be N divisions at lines. It may be a GNU extension.

`code(5 divisions)`


split -d -nl/5 data/popular-names.txt result/shell5.

`result(Check the number of lines with wc)`


  587  2348 11007 result/shell5.00
  554  2216 11010 result/shell5.01
  556  2224 11006 result/shell5.02
  540  2160 11007 result/shell5.03
  543  2172 10996 result/shell5.04
 2780 11120 55026 total

I also implemented it in python so that it behaves the same as the code for this GNU extension (https://github.com/coreutils/coreutils/blob/master/src/split.c).

`code`


def split_string_list(N, lst):
    chunk_size = sum([len(x) for x in lst]) // N
    chunk_ends = [chunk_size * (n + 1) - 1 for n in range(N)]
    
    i = 0
    acc = 0
    out = []
    for chunk_end in chunk_ends:
        tmp = []
        while acc < chunk_end:
            tmp.append(lst[i])
            acc += len(lst[i])
            i += 1
        out.append(tmp)
    return out

def split_file(N, filepath, outprefix):
    with open(filepath) as f:
        lst = list(f)
    lst = split_string_list(N, lst)
    for i, lines in enumerate(lst):
        idx = str(i).zfill(2) #Omission
        with open(outprefix + idx, 'w') as f:
            f.write(''.join(lines))

split_file(5, 'data/popular-names.txt', 'result/python5.')

First, count the total number of characters and decide the cutting position (chunk_ends) so that the number of characters is as uniform as possible. Then, it takes a line until it exceeds each element of chunk_ends, and when it exceeds, it outputs it to a file.

`result(Check the number of lines with wc)`


  587  2348 11007 result/python5.00
  554  2216 11010 result/python5.01
  556  2224 11006 result/python5.02
  540  2160 11007 result/python5.03
  543  2172 10996 result/python5.04
 2780 11120 55026 total

`result`


diff result/python5.00 result/shell5.00
diff result/python5.01 result/shell5.01
diff result/python5.02 result/shell5.02
diff result/python5.03 result/shell5.03
diff result/python5.04 result/shell5.04

I got the same result.

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.

`code`


names = set()
with open('data/popular-names.txt') as f:
    for line in f:
        name = line.split('\t')[0]
        names.add(name)
names = sorted(names)

for name in names:
    print(name)

`result(First 10 lines)`


Abigail
Aiden
Alexander
Alexis
Alice
Amanda
Amelia
Amy
Andrew
Angela

The first column is added to the set in order, sorted and output. (* The version of python is 3.8.2.)

`code`


cut -f 1 data/popular-names.txt | sort -s | uniq

Take only the first column and sort to remove duplicates. If you forget to enter sort, it will be strange. In addition, the s option is added to ensure stable sorting according to python.

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

`code`


with open('data/popular-names.txt') as f:
    lst = [line.strip() for line in f]
lst.sort(key = lambda x : -int(x.split('\t')[2]))
    
for line in lst[:10]:
    print(line)

`output(First 10 lines)`


Linda	F	99689	1947
Linda	F	96211	1948
James	M	94757	1947
Michael	M	92704	1957
Robert	M	91640	1947
Linda	F	91016	1949
Michael	M	90656	1956
Michael	M	90517	1958
James	M	88584	1948
Michael	M	88528	1954

You can specify the sorting criteria by specifying the key of the sort function.

`code`


sort -nrsk 3 data/popular-names.txt

You can do it only with the sort command. It's easy.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

Use collections.Counter.

`code`


from collections import Counter

`code`


cnt = Counter()
with open('data/popular-names.txt') as f:
    for line in f:
        name = line.split('\t')[0]
        cnt.update([name])
        
lst = cnt.most_common()
lst.sort(key=lambda x:(-x[1], x[0]))

for name, num in lst[:10]:
    print(name)

`output`


James
William
John
Robert
Mary
Charles
Michael
Elizabeth
Joseph
Margaret

You can either pass the list as is to the Counter object, or you can pass it little by little with ʻupdate (). most_common ()` will arrange them in descending order.

`code`


cut -f 1 data/popular-names.txt | sort | uniq -c | sort -nrsk1 | awk '{print $2}'

If you add the -c option when taking ʻuniq`, it will count how many there are. Finally, sort by number to get the desired result.

Next is Chapter 3

Language processing 100 knocks 2020 Chapter 3: Regular expressions

[PYTHON] 100 Language Processing Knock 2020 Chapter 2: UNIX Commands

Chapter 2: UNIX Commands

10. Counting the number of lines

code

output

code

output

11. Replace tabs with spaces

code

output(First 10 lines)

code

12. Save the first column in col1.txt and the second column in col2.txt

code

result(col1.First 10 lines of txt)

result(col2.First 10 lines of txt)

code

13. Merge col1.txt and col2.txt

code

code

result(First 10 lines)

code

14. Output N lines from the beginning

code

output

code

15. Output the last N lines

code

code

output

code

16. Divide the file into N

code(5 divisions)

result(Check the number of lines with wc)

code

result(Check the number of lines with wc)

result

17. Difference in the character string in the first column

code

result(First 10 lines)

code

18. Sort each row in descending order of the numbers in the third column

code

output(First 10 lines)

code

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

code

code

output

code

Next is Chapter 3

`code`

`output`

`code`

`output`

`code`

`output(First 10 lines)`

`code`

`code`

`result(col1.First 10 lines of txt)`

`result(col2.First 10 lines of txt)`

`code`

`code`

`code`

`result(First 10 lines)`

`code`

`code`

`output`

`code`

`code`

`code`

`output`

`code`

`code(5 divisions)`

`result(Check the number of lines with wc)`

`code`

`result(Check the number of lines with wc)`

`result`

`code`

`result(First 10 lines)`

`code`

`code`

`output(First 10 lines)`

`code`

`code`

`code`

`output`

`code`