[PYTHON] I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) "[Language Processing 100 Knock 2020 Edition](https://nlp100.github. io / ja /) ”is the fourth article in Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

Regarding the part I worked on this time, I will introduce two patterns, one with and without the data analysis library called pandas. I think that these language processes will often be used for machine learning in the future, so using pandas, which is also used for machine learning, may be more widely used. In the latter half, it's becoming more troublesome if I don't do it with pandas, so I only do it with pandas.

The UNIX command execution environment is MacOS, so if you are using Windows or Linux, please check the method that suits your environment as appropriate.

The source code is also available on GitHub.

Chapter 2: UNIX Commands

popular-names.txt tab-separates the "name", "gender", "number of people", and "year" of babies born in the United States. It is a file stored in the format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

`15.py`


import sys


if len(sys.argv) != 2:
    print("Set an argument N, for exapmle '$ python 15.py 3'.")
    sys.exit()

n = int(sys.argv[1])
file_name = "popular-names.txt"

with open(file_name) as rf:
    lines = rf.readlines()

for i in lines[len(lines) - n:len(lines)]:
    print("".join(i.rstrip()))

Read the entire file as a list line by line with the readlines () method on the file object. If you only want to output the last N lines of the input, you only had to output n times from the end of the list, but this time, in order to unify the UNIX command and output form, len as described above (lines) --The list from 1 to len (lines) is output line by line.

`15.sh`


#!/bin/bash

if [ $# -ne 1 ]; then
  echo "The specified argument is$#It is an individual." 1>&2
  echo "To execute, specify one number as an argument." 1>&2
  exit 1
fi

diff <(tail -n $1 popular-names.txt) <(python 15.py $1)

UNIX commands that output the trailing n line use the tail command. If you specify -n as an option and specify the number of lines, you can output as intended.

In pandas

`15pd.py`


import sys
import pandas as pd


if len(sys.argv) != 2:
    print("Set a argument N, for example '$ python 15pd.py 3'.")
    sys.exit()

n = int(sys.argv[1])
data_frame = pd.read_table("popular-names.txt", header=None)
print(data_frame.tail(n))

You can read it from a file as a DataFrame object and output it in the desired form simply by specifying the number of lines in the argument with the tail () method.

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

** * I misread the question sentence and split the file into N lines ** (I noticed it while writing the article, so I will fix it when I finish it.)

This was the hardest part of checking the diff with diff ...

`16.py`


import sys


if len(sys.argv) != 2:
    print("Set an argument N, for exapmle '$ python 15.py 3'.")
    sys.exit()

n = int(sys.argv[1])
file_name = "popular-names.txt"

with open(file_name) as rf:
    lines = rf.readlines()

file_count = 0
i = 1
output = ""
for line in lines:
    output += line
    if i <= n - 1:
        i += 1
        continue
    q, mod = divmod(file_count, 26)
    prefix = "./output/16/py_split_file_"
    suffix_1 = chr(ord('a') + q)
    suffix_2 = chr(ord('a') + mod)
    write_file = "{}{}{}".format(prefix, suffix_1, suffix_2)
    with open(write_file, mode='w') as wf:
        wf.write(output)
    file_count += 1
    output = ""
    i = 1

This also reads the entire file as a list line by line. Then, I added lines one by one to the variables created for output, and when I finished adding N lines, I output the file. The file output is a long process of making the output file name by the split command ...

The shell script to compare with the result of split is

`16.sh`


#!/bin/bash

SH="sh_"
PY="py_"
HEAD="split_file_"

if [ $# -ne 1 ]; then
  echo "The specified argument is$#It is an individual." 1>&2
  echo "To execute, specify one number as an argument." 1>&2
  exit 1
fi

split -l $1 popular-names.txt ./output/16/$SH$HEAD

for i in a b c d e f g h i j k l m n o p q r s t u v w x y z
do
  for j in a b c d e f g h i j k l m n o p q r s t u v w x y z
  do
    ADDRESS="./output/16/"
    SHFILE=$ADDRESS$SH$HEAD$i$j
    PYFILE=$ADDRESS$PY$HEAD$i$j
    if [ -e $SHFILE -a -e $PYFILE ]; then
      diff $SHFILE $PYFILE
    fi
  done
done

I'm having a hard time specifying the file name, but the point is the split command. The number of lines is specified by the -l option.

The part below split is a process for comparing with diff, so the explanation is omitted.

Processing by pandas

`16pd.py`


import sys
import pandas as pd


if len(sys.argv) != 2:
    print("Set a argument N, for example '$ python 15pd.py 3'.")
    sys.exit()

n = int(sys.argv[1])
data_frame = pd.read_table("popular-names.txt", header=None)

file_count = 0
i = 0
while i < len(data_frame):
    q, mod = divmod(file_count, 26)
    prefix = "./output/16/py_split_file_"
    suffix_1 = chr(ord('a') + q)
    suffix_2 = chr(ord('a') + mod)
    write_file = "{}{}{}".format(prefix, suffix_1, suffix_2)
    data_frame[i:i+n].to_csv(write_file, sep='\t', index=False, header=None)
    i += n
    file_count += 1

In pandas, the problem was solved by specifying a slice from the ʻith element to the ʻi + nth element of the DataFrame object and outputting it to a file. You can do the same thing without using pandas, so it might be a good idea to consider which one is better.

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.

`17.py`


file_name = "popular-names.txt"
with open(file_name) as f:
    lines = f.readlines()
    item1 = list(map(lambda x: x.split()[0], lines))

item1 = list(set(item1))
item1.sort()
print("\n".join(item1))

The task is to get the entire file as a line-by-line list, use the map function to get the first column, then specify the list and break it down character by character. There was support for using the sort command, so I listed and sorted the collections with set. Looking back, it was more appropriate to set the result of the map function to set than to set it to list ...

The shell script is as follows.

`17.sh`


#!/bin/bash

diff <(cut -f 1 -d $'\t' popular-names.txt | sort | uniq) <(python 17.py)

The cut command specifies the number of columns with the -f option, and the delimiter is specified as the tab character with the -d option. The first column that was cut out was passed to the sort command, sorted alphabetically, and then duplicated with the ʻuniq` command.

It's easy to use numpy in addition to pandas, and it's a library.

`17pd.py`


import pandas as pd
import numpy as np

data_frame = pd.read_table("popular-names.txt", header=None)
print("\n".join(np.sort(data_frame[0].unique())))

Read the data with pandas and remove the duplicates with the ʻunique ()method. You can clear the problem at once by sorting with numpy andjoin` with line breaks.

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

From here on, I'm using pandas.

`18.py`


import pandas as pd


data_frame = pd.read_table("popular-names.txt", header=None)
print(data_frame.sort_values(2, ascending=False))

You can easily do this by using the sort_values () method. The first argument is the specification of the column to sort. The second argument, ʻascending, specifies ascending / descending order, and the default is True in ascending order, but here it is set to False` because it indicates descending order.

The sort command is used as follows.

`18.sh`


#!/bin/bash

sort -t $'\t' -k 3 -n -r popular-names.txt

The -t option specifies the delimiter, the -k option specifies the number of columns, the -r option specifies the descending order, and the -n option sorts the numbers.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

`19.py`


import pandas as pd

data_frame = pd.read_table("popular-names.txt", header=None)
data_frame_sort = data_frame[0].value_counts()
print(pd.Series(data_frame_sort.index.values, data_frame_sort.values))

The issue is cleared by using the value_counts () method on the DataFrame object. The value_counts () method returns Series objects with unique element values of ʻindex and their frequency of occurrence data` in descending order by default. When outputting, the elements and appearance frequency were exchanged for easy viewing.

In a shell script

`19.sh`


#!/bin/bash

cut -f 1 -d $'\t' popular-names.txt | sort | uniq -c | sort -k 1 -n -r

It was made. Up to sort is the same as 17.sh, but the -c option is specified in the ʻuniqcommand to calculate the number of duplicate lines. After that, thesort command was used again to specify the first column (the number of duplicate rows) with -k`, and the numbers were sorted in descending order and output.

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 2: UNIX command problem numbers 15 to 19.

Although I struggled to use UNIX commands, it also has the advantage of becoming familiar with pandas. I knew that there are many things that I want to do easily just by using the library, so I would like to use the library more and more in the future.

I'm still immature, so if you have a better answer, please let me know! !! Thank you.

Continued

-Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]

Until last time

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04] -I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14]