[PYTHON] 100 Language Processing Knock 2020 Chapter 2: UNIX Commands

The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

Chapter 1 is here.

The environment is Python 3.8.2 and Ubuntu 18.04.

I think that here is easier to understand as a commentary article. I would like the authors to write commentary articles up to Chapter 10.

Chapter 2: UNIX Commands

popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with a UNIX command and check the execution result of the program.

Please download the required dataset from here.

The downloaded file shall be placed under data.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

code


with open('data/popular-names.txt') as f:
    print(len(list(f)))

output


2780

Just find the length of the file object. Since the file object is an iterator, it must be listed. If the input is a large enough file, it may not fit in the memory, but in such a case, you can just turn it with a for statement and count it.

code


wc -l < data/popular-names.txt

output


2780

Find the number of lines by specifying the l option in the wc command. If you give a file name, various extras will be displayed, so give it from the standard input.

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

code


with open('data/popular-names.txt') as f:
    for line in f:
        line = line.strip()
        line = line.replace('\t', ' ')
        print(line)

output(First 10 lines)


Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880
Ida F 1472 1880
Alice F 1414 1880
Bertha F 1320 1880
Sarah F 1288 1880

Each string obtained by turning the file object as an iterator has a newline character at the end, so remove it with strip (rstrip ('\ n') may be preferable). Just replace the tabs with spaces and output.

There is also a method of print (line, end ='') without removing the newline character with strip.

code


awk '{gsub("\t", " ", $0); print $0}' data/popular-names.txt
perl -pe 's/\t/ /g' data/popular-names.txt
sed 's/\t/ /g'  data/popular-names.txt
expand -t 1 data/popular-names.txt
tr '\t' ' ' < data/popular-names.txt

The output is the same as that of Python, so I will omit it. (The same applies thereafter)

There are various UNIX commands and I can't remember them easily.

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

code


with open('data/popular-names.txt') as f, \
        open('result/col1.txt', 'w') as g, \
        open('result/col2.txt', 'w') as h:
    for line in f:
        line = line.strip()
        pref, city, _, _  = line.split('\t')
        print(pref, file=g)
        print(city, file=h)

I wrote it obediently.

result(col1.First 10 lines of txt)


Mary
Anna
Emma
Elizabeth
Minnie
Margaret
Ida
Alice
Bertha
Sarah

result(col2.First 10 lines of txt)


F
F
F
F
F
F
F
F
F
F

code


cut -f 1 data/popular-names.txt > col1.txt
cut -f 2 data/popular-names.txt > col2.txt

It's easy with the cut command. ʻAwk'{print $ 1}'` is fine.

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

All you have to do is open the two files separately, but since it's a big deal, use contextlib.ExitStack to implement it so that you can handle any number of files. For context manager → https://docs.python.org/ja/3/library/stdtypes.html#typecontextmanager

code


from contextlib import ExitStack

code


files = ['result/col1.txt', 'result/col2.txt']
with ExitStack() as stack:
    files = [stack.enter_context(open(filename)) for filename in files]
    for lines in zip(*files):
        x = [line.strip() for line in lines]
        x = '\t'.join(x)
        print(x)

result(First 10 lines)


Mary	F
Anna	F
Emma	F
Elizabeth	F
Minnie	F
Margaret	F
Ida	F
Alice	F
Bertha	F
Sarah	F

code


paste result/col1.txt result/col2.txt

It's easy with the paste command.

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

Personally, I like to receive standard input and command line arguments with ʻargparseandfileinput`, but this time I want to be able to run all the code on jupyter notebook, so I use command line arguments not. (I feel kindness to "etc." in the question sentence)

code


N = 5
with open('data/popular-names.txt') as f:
    lst = range(N)
    for _, line in zip(lst, f):
        print(line, end='')

output


Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880

You can do the same with the head command.

code


head -n 5 data/popular-names.txt

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

I think it's okay to put all the standard inputs in a list and retrieve the last 5 elements, but if it's a large file it may not fit in memory, so I'll use a queue.

code


from collections import deque

code


N = 5
queue = deque([], 5)
with open('data/popular-names.txt') as f:
    for line in f:
        queue.append(line)
for line in queue:
    print(line, end='')

output


Benjamin	M	13381	2018
Elijah	M	12886	2018
Lucas	M	12585	2018
Mason	M	12435	2018
Logan	M	12352	2018

You can do the same with the tail command.

code


tail -n 5 data/popular-names.txt

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

I think this is because there aren't many situations where a file is divided into N lines, but depending on the implementation of the split command, there may not be N divisions at lines. It may be a GNU extension.

code(5 divisions)


split -d -nl/5 data/popular-names.txt result/shell5.

result(Check the number of lines with wc)


  587  2348 11007 result/shell5.00
  554  2216 11010 result/shell5.01
  556  2224 11006 result/shell5.02
  540  2160 11007 result/shell5.03
  543  2172 10996 result/shell5.04
 2780 11120 55026 total

I also implemented it in python so that it behaves the same as the code for this GNU extension (https://github.com/coreutils/coreutils/blob/master/src/split.c).

code


def split_string_list(N, lst):
    chunk_size = sum([len(x) for x in lst]) // N
    chunk_ends = [chunk_size * (n + 1) - 1 for n in range(N)]
    
    i = 0
    acc = 0
    out = []
    for chunk_end in chunk_ends:
        tmp = []
        while acc < chunk_end:
            tmp.append(lst[i])
            acc += len(lst[i])
            i += 1
        out.append(tmp)
    return out

def split_file(N, filepath, outprefix):
    with open(filepath) as f:
        lst = list(f)
    lst = split_string_list(N, lst)
    for i, lines in enumerate(lst):
        idx = str(i).zfill(2) #Omission
        with open(outprefix + idx, 'w') as f:
            f.write(''.join(lines))

split_file(5, 'data/popular-names.txt', 'result/python5.')

First, count the total number of characters and decide the cutting position (chunk_ends) so that the number of characters is as uniform as possible. Then, it takes a line until it exceeds each element of chunk_ends, and when it exceeds, it outputs it to a file.

result(Check the number of lines with wc)


  587  2348 11007 result/python5.00
  554  2216 11010 result/python5.01
  556  2224 11006 result/python5.02
  540  2160 11007 result/python5.03
  543  2172 10996 result/python5.04
 2780 11120 55026 total

result


diff result/python5.00 result/shell5.00
diff result/python5.01 result/shell5.01
diff result/python5.02 result/shell5.02
diff result/python5.03 result/shell5.03
diff result/python5.04 result/shell5.04

I got the same result.

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.

code


names = set()
with open('data/popular-names.txt') as f:
    for line in f:
        name = line.split('\t')[0]
        names.add(name)
names = sorted(names)

for name in names:
    print(name)

result(First 10 lines)


Abigail
Aiden
Alexander
Alexis
Alice
Amanda
Amelia
Amy
Andrew
Angela

The first column is added to the set in order, sorted and output. (* The version of python is 3.8.2.)

code


cut -f 1 data/popular-names.txt | sort -s | uniq 

Take only the first column and sort to remove duplicates. If you forget to enter sort, it will be strange. In addition, the s option is added to ensure stable sorting according to python.

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

code


with open('data/popular-names.txt') as f:
    lst = [line.strip() for line in f]
lst.sort(key = lambda x : -int(x.split('\t')[2]))
    
for line in lst[:10]:
    print(line)

output(First 10 lines)


Linda	F	99689	1947
Linda	F	96211	1948
James	M	94757	1947
Michael	M	92704	1957
Robert	M	91640	1947
Linda	F	91016	1949
Michael	M	90656	1956
Michael	M	90517	1958
James	M	88584	1948
Michael	M	88528	1954

You can specify the sorting criteria by specifying the key of the sort function.

code


sort -nrsk 3 data/popular-names.txt

You can do it only with the sort command. It's easy.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

Use collections.Counter.

code


from collections import Counter

code


cnt = Counter()
with open('data/popular-names.txt') as f:
    for line in f:
        name = line.split('\t')[0]
        cnt.update([name])
        
lst = cnt.most_common()
lst.sort(key=lambda x:(-x[1], x[0]))

for name, num in lst[:10]:
    print(name)

output


James
William
John
Robert
Mary
Charles
Michael
Elizabeth
Joseph
Margaret

You can either pass the list as is to the Counter object, or you can pass it little by little with ʻupdate (). most_common ()` will arrange them in descending order.

code


cut -f 1 data/popular-names.txt | sort | uniq -c | sort -nrsk1 | awk '{print $2}'

If you add the -c option when taking ʻuniq`, it will count how many there are. Finally, sort by number to get the desired result.

Next is Chapter 3

Language processing 100 knocks 2020 Chapter 3: Regular expressions

Recommended Posts

100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock UNIX Commands Learned in Chapter 2
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 Language Processing Knock (2020): 28
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock (2020): 38
I tried 100 language processing knock 2020: Chapter 1
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (Second Half)
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock Regular Expressions Learned in Chapter 3
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction