The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) "[Language Processing 100 Knock 2020 Edition](https://nlp100.github. io / ja /) ”is the fourth article in Python (3.7).
Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.
Regarding the part I worked on this time, I will introduce two patterns, one with and without the data analysis library called pandas. I think that these language processes will often be used for machine learning in the future, so using pandas, which is also used for machine learning, may be more widely used. In the latter half, it's becoming more troublesome if I don't do it with pandas, so I only do it with pandas.
The UNIX command execution environment is MacOS, so if you are using Windows or Linux, please check the method that suits your environment as appropriate.
The source code is also available on GitHub.
popular-names.txt tab-separates the "name", "gender", "number of people", and "year" of babies born in the United States. It is a file stored in the format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
15.py
import sys
if len(sys.argv) != 2:
print("Set an argument N, for exapmle '$ python 15.py 3'.")
sys.exit()
n = int(sys.argv[1])
file_name = "popular-names.txt"
with open(file_name) as rf:
lines = rf.readlines()
for i in lines[len(lines) - n:len(lines)]:
print("".join(i.rstrip()))
Read the entire file as a list line by line with the readlines ()
method on the file object.
If you only want to output the last N lines of the input, you only had to output n
times from the end of the list, but this time, in order to unify the UNIX command and output form, len as described above (lines) --The list from 1
to len (lines)
is output line by line.
15.sh
#!/bin/bash
if [ $# -ne 1 ]; then
echo "The specified argument is$#It is an individual." 1>&2
echo "To execute, specify one number as an argument." 1>&2
exit 1
fi
diff <(tail -n $1 popular-names.txt) <(python 15.py $1)
UNIX commands that output the trailing n
line use the tail
command.
If you specify -n
as an option and specify the number of lines, you can output as intended.
In pandas
15pd.py
import sys
import pandas as pd
if len(sys.argv) != 2:
print("Set a argument N, for example '$ python 15pd.py 3'.")
sys.exit()
n = int(sys.argv[1])
data_frame = pd.read_table("popular-names.txt", header=None)
print(data_frame.tail(n))
You can read it from a file as a DataFrame object and output it in the desired form simply by specifying the number of lines in the argument with the tail ()
method.
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
** * I misread the question sentence and split the file into N lines ** (I noticed it while writing the article, so I will fix it when I finish it.)
This was the hardest part of checking the diff with diff
...
16.py
import sys
if len(sys.argv) != 2:
print("Set an argument N, for exapmle '$ python 15.py 3'.")
sys.exit()
n = int(sys.argv[1])
file_name = "popular-names.txt"
with open(file_name) as rf:
lines = rf.readlines()
file_count = 0
i = 1
output = ""
for line in lines:
output += line
if i <= n - 1:
i += 1
continue
q, mod = divmod(file_count, 26)
prefix = "./output/16/py_split_file_"
suffix_1 = chr(ord('a') + q)
suffix_2 = chr(ord('a') + mod)
write_file = "{}{}{}".format(prefix, suffix_1, suffix_2)
with open(write_file, mode='w') as wf:
wf.write(output)
file_count += 1
output = ""
i = 1
This also reads the entire file as a list line by line.
Then, I added lines one by one to the variables created for output, and when I finished adding N lines, I output the file.
The file output is a long process of making the output file name by the split
command ...
The shell script to compare with the result of split
is
16.sh
#!/bin/bash
SH="sh_"
PY="py_"
HEAD="split_file_"
if [ $# -ne 1 ]; then
echo "The specified argument is$#It is an individual." 1>&2
echo "To execute, specify one number as an argument." 1>&2
exit 1
fi
split -l $1 popular-names.txt ./output/16/$SH$HEAD
for i in a b c d e f g h i j k l m n o p q r s t u v w x y z
do
for j in a b c d e f g h i j k l m n o p q r s t u v w x y z
do
ADDRESS="./output/16/"
SHFILE=$ADDRESS$SH$HEAD$i$j
PYFILE=$ADDRESS$PY$HEAD$i$j
if [ -e $SHFILE -a -e $PYFILE ]; then
diff $SHFILE $PYFILE
fi
done
done
I'm having a hard time specifying the file name, but the point is the split
command.
The number of lines is specified by the -l
option.
The part below split
is a process for comparing with diff
, so the explanation is omitted.
Processing by pandas
16pd.py
import sys
import pandas as pd
if len(sys.argv) != 2:
print("Set a argument N, for example '$ python 15pd.py 3'.")
sys.exit()
n = int(sys.argv[1])
data_frame = pd.read_table("popular-names.txt", header=None)
file_count = 0
i = 0
while i < len(data_frame):
q, mod = divmod(file_count, 26)
prefix = "./output/16/py_split_file_"
suffix_1 = chr(ord('a') + q)
suffix_2 = chr(ord('a') + mod)
write_file = "{}{}{}".format(prefix, suffix_1, suffix_2)
data_frame[i:i+n].to_csv(write_file, sep='\t', index=False, header=None)
i += n
file_count += 1
In pandas, the problem was solved by specifying a slice from the ʻith element to the ʻi + n
th element of the DataFrame object and outputting it to a file.
You can do the same thing without using pandas, so it might be a good idea to consider which one is better.
Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.
17.py
file_name = "popular-names.txt"
with open(file_name) as f:
lines = f.readlines()
item1 = list(map(lambda x: x.split()[0], lines))
item1 = list(set(item1))
item1.sort()
print("\n".join(item1))
The task is to get the entire file as a line-by-line list, use the map
function to get the first column, then specify the list and break it down character by character.
There was support for using the sort
command, so I listed and sorted the collections with set
.
Looking back, it was more appropriate to set the result of the map
function to set
than to set it to list
...
The shell script is as follows.
17.sh
#!/bin/bash
diff <(cut -f 1 -d $'\t' popular-names.txt | sort | uniq) <(python 17.py)
The cut
command specifies the number of columns with the -f
option, and the delimiter is specified as the tab character with the -d
option.
The first column that was cut out was passed to the sort
command, sorted alphabetically, and then duplicated with the ʻuniq` command.
It's easy to use numpy in addition to pandas, and it's a library.
17pd.py
import pandas as pd
import numpy as np
data_frame = pd.read_table("popular-names.txt", header=None)
print("\n".join(np.sort(data_frame[0].unique())))
Read the data with pandas and remove the duplicates with the ʻunique ()method. You can clear the problem at once by sorting with numpy and
join` with line breaks.
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
From here on, I'm using pandas.
18.py
import pandas as pd
data_frame = pd.read_table("popular-names.txt", header=None)
print(data_frame.sort_values(2, ascending=False))
You can easily do this by using the sort_values ()
method.
The first argument is the specification of the column to sort.
The second argument, ʻascending, specifies ascending / descending order, and the default is
True in ascending order, but here it is set to
False` because it indicates descending order.
The sort
command is used as follows.
18.sh
#!/bin/bash
sort -t $'\t' -k 3 -n -r popular-names.txt
The -t
option specifies the delimiter, the -k
option specifies the number of columns, the -r
option specifies the descending order, and the -n
option sorts the numbers.
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
19.py
import pandas as pd
data_frame = pd.read_table("popular-names.txt", header=None)
data_frame_sort = data_frame[0].value_counts()
print(pd.Series(data_frame_sort.index.values, data_frame_sort.values))
The issue is cleared by using the value_counts ()
method on the DataFrame object.
The value_counts ()
method returns Series objects with unique element values of ʻindex and their frequency of occurrence
data` in descending order by default.
When outputting, the elements and appearance frequency were exchanged for easy viewing.
In a shell script
19.sh
#!/bin/bash
cut -f 1 -d $'\t' popular-names.txt | sort | uniq -c | sort -k 1 -n -r
It was made.
Up to sort
is the same as 17.sh
, but the -c
option is specified in the ʻuniqcommand to calculate the number of duplicate lines. After that, the
sort command was used again to specify the first column (the number of duplicate rows) with
-k`, and the numbers were sorted in descending order and output.
In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 2: UNIX command problem numbers 15 to 19.
Although I struggled to use UNIX commands, it also has the advantage of becoming familiar with pandas. I knew that there are many things that I want to do easily just by using the library, so I would like to use the library more and more in the future.
I'm still immature, so if you have a better answer, please let me know! !! Thank you.
-Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]
-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04] -I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14]
Recommended Posts