[PYTHON] I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) "[Language Processing 100 Knock 2020 Edition](https://nlp100.github. io / ja /) ”is the third article in Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

Regarding the part I worked on this time, I will introduce two patterns, one with and without the data analysis library called pandas. I think that these language processes will often be used for machine learning in the future, so using pandas, which is also used for machine learning, may be more widely used.

The UNIX command execution environment is MacOS, so if you are using Windows or Linux, please check the method that suits your environment as appropriate.

The source code is also available on GitHub.

Chapter 2: UNIX Commands

popular-names.txt tab-separates the "name", "gender", "number of people", and "year" of babies born in the United States. It is a file stored in the format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with a UNIX command and check the execution result of the program.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

`10.py`


file_name = "popular-names.txt"
with open(file_name) as f:
    lines = f.readlines()
print(len(lines))

To read a file, the file object is usually opened with ʻopen ()and must be closed with theclose ()method. However, if you use thewith` block as described above, it will be closed automatically at the end of the block to prevent you from forgetting to close it.

Read the entire file as a list line by line with the readlines () method on the file object. The size of the read list is the number of lines.

Next, let's see how to check the number of lines by wc.

`10.sh`


#!/bin/sh

wc popular-names.txt

The number of lines, the number of words, and the number of bytes of the text file specified by wc text file name in the command line tool are displayed. It's okay if the number of lines displayed first matches your Python program.

Using pandas you can write:

`10pd.py`


import pandas as pd


file_name = "popular-names.txt"
data_frame = pd.read_table(file_name, header=None)
print(len(data_frame))

In pandas, you can read a tab-delimited text file as a matrix with the read_table () method. This time, the actual data is included from the first line of the data, and the header indicating what the data represents is not included, so the option header = None is added.

After reading, you can find the number of lines by calculating the size.

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

`11.py`


file_name = "popular-names.txt"
with open(file_name) as f:
    text = f.read()
print(text.replace("\t", " "))

This time, the file object was read using the read () method. The read () method is a method to read the entire document as one text data.

After reading, the replace () method converts the tab character to a half-width space and outputs it. The replace () method passes the character before conversion as the first argument and the character after conversion as the second argument.

Next is confirmation by UNIX command.

`11.sh`


#!/bin/sh

cat  ./popular-names.txt | sed s/$'\t'/' '/g
echo "---"

cat ./popular-names.txt | tr '\t' ' '
echo "---"

expand -t 1 popular-names.txt

We confirm by three methods.

cat + sed

The first is a combination of cat and sed. The | before sed passes the processing result of the previous command to the subsequent command.

cat is a command to output the contents of a file. By outputting the contents of the file, the contents of the file are passed to sed.

sed is a command that performs various processing on a character string. This time, the character string is specified after the sed command and replaced.

The character string is specified by s / character before replacement / character after replacement / g. For g, if this is removed, only the replacement target that appears first will be replaced. You may have had the hardest time finding out that tabs are represented by $'\ t' on the Mac ...

cat + tr

The tr command is a command that converts / deletes the read character string and outputs it. You can do more than sed.

For the text data passed from cat, replace the character of the first argument with the character of the second argument. Here, it is `tr $'\ t'''``, so the tab character of the text data is converted to a half-width space.

expand

ʻExpandis a command that converts tab characters to spaces for even more limited uses. Specify the number of tab characters with the option-t` and specify the file name as an argument to convert. It's simple and easy to understand!

When using pandas, it looks like this:

`11pd.py`


import pandas as pd


file_name = "popular-names.txt"
data_frame = pd.read_table(file_name, header=None)
data_frame.to_csv("output/11pd_ans.txt", sep=" ", index=False, header=None)

It is output as a file with the to_csv () method. When outputting a file, conversion is performed from tab characters by specifying a half-width space as a delimiter. If you try to make the data similar to the input file, you don't need the index and header, so I specify ʻindex = False and header = None` respectively so that they are not included in the file.

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

`12.py`


file_name = "popular-names.txt"
output_file1 = "output/col1.txt"
output_file2 = "output/col2.txt"
output_files = [output_file1, output_file2]

col1 = []
col2 = []
with open(file_name) as rf:
    for line in rf:
        item1 = line.split()[0]
        item2 = line.split()[1]
        col1.append(item1)
        col2.append(item2)
cols = [col1, col2]

for output_file, col in zip(output_files, cols):
    with open(output_file, mode='w') as wf:
        wf.write("\n".join(col) + "\n")

Prepare a list to record each column called col1 and col2, read the file line by line, and record only the first and second columns respectively. List the data in each column so that you can output the files together.

Finally, use the zip function to output each column as one file. At the time of output, each element is combined with a line break. The last +" \ n " is to prevent an extra difference when comparing the difference with the command in the shell script.

The confirmation command was executed as follows.

`12.sh`


#!/bin/bash

diff output/col1.txt <(cut -f 1 -d $'\t' popular-names.txt)
diff output/col2.txt <(cut -f 2 -d $'\t' popular-names.txt)

Since it seemed to be possible to make an exact comparison, I used the diff command to get the difference. The basic usage of the diff command is

$diff File you want to compare 1 File you want to compare 2

However, this time I passed it to diff with<(command)to compare the file and the command execution result.

The main subject is cut, but this is a command to divide the contents of a file by specifying a column. Use -f to specify the number of columns, and -d to specify the column delimiter. Again, the delimiter is $'\ t', which specifies the tab character.

If you use pandas, you can write more concisely.

`12pd.py`


import pandas as pd


data_frame = pd.read_table("popular-names.txt", header=None)
data_frame[0].to_csv("output/col1pd.txt", index=False, header=None)
data_frame[1].to_csv("output/col2pd.txt", index=False, header=None)

All you have to do is read all the data tab-delimited, specify the column of the read object (DataFrame object), and output to a file.

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

`13.py`


col1_file = "output/col1.txt"
col2_file = "output/col2.txt"
cols_file = "output/cols.txt"
col_files = [col1_file, col2_file]

cols = []
for file_name in col_files:
    with open(file_name) as rf:
        cols.append(rf.readlines())

output = ""
for col1, col2 in zip(cols[0], cols[1]):
    output += col1.rstrip() + "\t" + col2.rstrip() + "\n"

with open(cols_file, mode='w') as wf:
    wf.write(output)

It is processed by the combination of the contents so far. By reading the file of each column created in 12 one row at a time and putting them together in a list, the zip function can handle the same row at the same time. In the loop, each fetched line is combined after removing the trailing newline with the rstrip () method, and a newline is added at the end to add it to the output string.

Finally, the character string is output to a file and the processing is completed.

`13.sh`


#!/bin/bash

diff output/cols.txt <(paste output/col1.txt output/col2.txt)

Use the paste command to combine files with UNIX commands. By specifying the file as an argument, the files are combined in the column direction. It is possible to specify the combining character with the -d option, but since it is combined with the tab character by default, it is not specified this time.

By the way, use the cat command to combine files in the line direction.

In pandas, you can make it as follows.

`13pd.py`


import pandas as pd


c1 = pd.read_table("output/col1pd.txt", header=None)
c2 = pd.read_table("output/col2pd.txt", header=None)

data_frame = pd.concat([c1, c2], axis=1)
data_frame.to_csv("output/colspd.txt", sep='\t', index=False, header=None)

This is also much shorter than writing normally. Use the concat () method to combine elements. Pass the columns you want to join as an argument in order, and specify the join direction with the ʻaxis option. The ʻaxis option defaults to 0 in the row direction, so 1 is specified for the column direction.

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

`14.py`


import sys


if len(sys.argv) != 2:
    print("Set an argument N, for example '$ python 14.py 3'.")
    sys.exit()

n = int(sys.argv[1])
file_name = "popular-names.txt"

with open(file_name) as rf:
    for i in range(n):
        print(rf.readline().rstrip())

I'm importing a sys module to use command line arguments. You can get the command line arguments as a list with sys.argv. If the number of arguments is not what you intended, the process of outputting the instruction statement and terminating the program is included.

This time, the process was completed while reading the file. I don't think it's necessary to read all the data in the file, so I'm using the readline () method to get line by line from the file object. Outputs one line at a time from the beginning for the number of command line arguments.

UNIX commands are called

`14.sh`


#!/bin/bash

if [ $# -ne 1 ]; then
  echo "The specified argument is$#It is an individual." 1>&2
  echo "To execute, specify one number as an argument." 1>&2
  exit 1
fi

diff <(head -n $1 popular-names.txt) <(python 14.py $1)

When executing a shell script, it takes an argument, issues an instruction statement if the number of arguments is not the specified number, and examines the difference between the head that displays the specified number of lines from the first line and the output of Python.

The head command displays the number of lines specified by the -n option from the first line.

In pandas

`14pd.py`


import sys
import pandas as pd


if len(sys.argv) != 2:
    print("Set a argument N, for example '$ python 14.py 3'.")
    sys.exit()

n = int(sys.argv[1])
data_frame = pd.read_table("popular-names.txt", header=None)
print(data_frame.head(n))

It was made. You can clear the issue by specifying the number of lines with the head () method.

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 2: UNIX command problem numbers 10-14.

To be honest, I had a harder time using UNIX commands than programming in Python ... However, it seems that it is faster to do it with commands than to execute it with Python, so it may be more efficient to remember the commands for large-scale data such as those used for machine learning.

I'm still immature, so if you have a better answer, please let me know! !! Thank you.

Continued

-I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24]

Until last time

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04] -I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]