100 Language Processing Knock with Python (Chapter 2, Part 1)

Introduction

Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.

This is a continuation of Chapter 1. Only here, I think it would be good to insert a little explanation of UNIX commands as well as Python. For detailed options of UNIX commands, check the man command or ITpro's website and you will be able to study properly!

Chapter 2: UNIX Command Basics

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

Answer in Python

10.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 10.py

import sys

f = open(sys.argv[1])
lines = f.readlines()
print(len(lines))

f.close()

Comments on python answers

Since the problem statement says "hightemp.txt as an input file", I designed it so that it can take command line arguments using sys.argv. At the time of execution, it is set to $ python 10.py high temp.txt, so in this casesys.argv [0] == "10.py",sys.argv [1] == "hightemp.txt" It means that the character string is stored.

Regarding reading files

  1. f = open (filename)
  2. hoge = f.read() / f.readline() / f.readlines()
  3. f.close()

I will go with the flow. The three types of functions that appear in 2 . Behave as follows. Please use properly as needed.

Pattern using with

For reading (writing) a file, there is a writing method that uses with in addition to the writing method that involvesclose ()as described above. It seems that this is recommended to prevent forgetting to add close () and forgetting to handle exceptions, which are common when using with. The following program is a trial rewrite of 10.py using with.

When using the with syntax


#!/usr/bin/env python
# -*- coding:utf-8 -*-
# 10.py

import sys

with open(sys.argv[1]) as f:
    lines = f.readlines()

print(len(lines))

After that, in principle, use with to read and write files. Only if you can't write with with (is it there?), Program in the legacy way.

UNIX answer

$ wc -l hightemp.txt
      24 hightemp.txt

Comments on UNIX Answers

The wc command will display the number of lines, words, and bytes in the file. If no option is specified, it will be output in order as follows.

$ wc hightemp.txt 
      24      98     813 hightemp.txt

The options are -l for the number of lines, -w for the number of words, and -c for the number of bytes.

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

Answer in Python

11.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 11.py

import sys

with open(sys.argv[1]) as f:
    str = f.read()

print(str.replace("\t", " "))

Comments on Python Answers

Unlike the previous one that focused on lines, this time I just want to replace characters at once, so I simply use read (). The replace () function that appeared in the previous chapter replaces the tab character (\ t) with a space. ~~ I don't like the extra line breaks left at the end of the output result, but that's pretty cute ...? ~~ By default, print () will have a newline at the end. To avoid this, in Python 2, you can add a comma at the end, such as print" hogehoge ",. In Python 3, you can specify the character to be added to the end with ʻend, such as print ("hogehoge", end = "") , so you can specify "" `.

UNIX answer

//sed version (Note that it depends on the environment)
$ sed -e s/$'\t'/" "/g hightemp.txt
// tr version
$ cat hightemp.txt | tr "\t" " "
// expand version
$ expand -t 1 hightemp.txt

//The result is the same
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
(Omitted...)
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02

Comments on UNIX Answers

sed is a convenient command that can handle various character editing, but for limited purposes (character replacement) like this time, it would be wise to use the command (tr) for it. On the contrary, ʻexpand` has too limited uses, so you may not have a chance to touch it.

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

Answer in Python

12.py


#! /usr/bin/env python
# -*- coding:utf-8 -*-
# 12.py

import sys


def write_col(source_lines, colunm_number, filename):
    col = []
    for line in source_lines:
        col.append(line.split()[colunm_number] + "\n")
    with open(filename, "w") as writer:
        writer.writelines(col)


with open(sys.argv[1]) as f:
    lines = f.readlines()

write_col(lines, 0, "col1.txt")
write_col(lines, 1, "col2.txt")


Comments on Python Answers

I made it a function because it performs similar processing. Write the line specified by the 2nd argument of the list received by the 1st argument as the file name of the 3rd argument. ʻAppend ()` adds a newline character to improve the appearance.

I don't use any new technology, so I can comment on this much, but I'm embarrassed that the algorithms are different depending on the program ... Details will be described later, but it is posted as it is for reflection.

UNIX answer

$ cut -f 1 hightemp.txt
Kochi Prefecture
Saitama
Gifu Prefecture
(Omitted...)
Yamanashi Prefecture
Yamagata Prefecture
Aichi prefecture
$ cut -f 2 hightemp.txt
Ekawasaki
Kumagaya
Tajimi
(Omitted...)
Otsuki
Tsuruoka
Nagoya

Comments on UNIX Answers

As with Python, what we are doing is specifying fields (lines) with -f. Note that in Python it was zero-based (line 0, line 1 ...), whereas in UNIX commands it was one-based (line 1, line 2 ...).

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

Answer in Python

13.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 13.py

with open("col1.txt") as f1, open("col2.txt") as f2:
    lines1, lines2 = f1.readlines(), f2.readlines()

with open("merge.txt", "w") as writer:
    for col1, col2 in zip(lines1, lines2):
        writer.write("\t".join([col1.rstrip(), col2]))

Comments on Python Answers

I'm getting used to Python, so I wrote the first half of the reading part with a little familiarity. It is strong to be able to write like this. For the writing part in the latter half, I tried to write using zip () as a review of Chapter 1. Contrary to 12., this time both newline characters remain at the end of col1 and col2, so the newline character at the end of col1 is removed by rstrip ().

Here is a review and rewritten intensional notation.

When rewritten by intensional notation


#If it is in parentheses, it will be interpreted properly even if a line break occurs in the code
with open("merge.txt", "w") as writer:
    writer.write(
        "\n".join(
            ["\t".join([col1.rstrip(), col2.rstrip()])
                for col1, col2 in zip(lines1, lines2)]
        )
    )
    

Comparison of execution time

Since various notations came out, I tried to measure and compare the execution time of each method using timeit.

Execution time measurement program using timeit


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 13_timeit.py

import timeit

#Preprocessing; col1,col2.Read txt
s0 = """
with open("col1.txt") as f1, open("col2.txt") as f2:
    lines1, lines2 = f1.readlines(), f2.readlines()
"""

#naive implementation;Add strings
s1 = """
merged_txt = ""
for i in xrange(len(lines1)):
    merged_txt = merged_txt + lines1[i].rstrip() + "\t" + lines2[i]

with open("merge.txt", "w") as writer:
    writer.write(merged_txt)
"""

#Implementation using zip
s2 = """
with open("merge.txt", "w") as writer:
    for col1, col2 in zip(lines1, lines2):
        writer.write("\t".join([col1.rstrip(), col2]))
"""

#Intensional notation(connotation)Implementation by
# "\\n"If you don't write, you will get a SyntaxError ... why?
s3 = """
with open("merge.txt", "w") as writer:
    writer.write(
        "\\n".join(
            ["\t".join([col1.rstrip(), col2.rstrip()])
                for col1, col2 in zip(lines1, lines2)]
        )
    )
"""

print("naive:", timeit.repeat(stmt=s1, setup=s0, number=100000))
print("zip:", timeit.repeat(stmt=s2, setup=s0, number=100000))
print("connotation:", timeit.repeat(stmt=s3, setup=s0, number=100000))

It is the calculation time (seconds) when 100000 laps of the loop are performed 3 times (default) by 3 types of methods. According to the Official Document, the execution time should be evaluated by the minimum value, not the average or maximum value.

Execution result


$ python 13_timeit.py
('naive:', [32.61601686477661, 47.96871089935303, 33.15881299972534])
('zip:', [49.846755027770996, 45.05450105667114, 58.70397615432739])
('connotation:', [46.472286224365234, 52.708040952682495, 46.71139121055603])

As a result, in terms of execution time alone, the method of simply adding character strings was the best. Even if the order is changed. In general, it seems that speedup can be expected with the comprehension method, but what is the boundary between speedup and non-speedup?

UNIX answer

$ paste col1.txt col2.txt 
Kochi Prefecture Ekawasaki
Kumagaya, Saitama Prefecture
Gifu Prefecture Tajimi
(Omitted...)
Yamanashi Prefecture Otsuki
Yamagata Prefecture Tsuruoka
Aichi Prefecture Nagoya

Comments on UNIX Answers

The paste command concatenates files horizontally. The default delimiter is tab, but it can be specified with the -d option.

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

Answer in Python

14.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 14.py

# Usage: python 14.py [filename] [number of lines]

import sys

with open(sys.argv[1]) as f:
    lines = f.readlines()

for line in lines[:int(sys.argv[2])]:
    print line,

Comments on Python Answers

At first, I used the following implementation using xrange (), but

xrange()Implementation using


#Omission

for i in xrange(int(sys.argv[2])):
    print lines[i],

If you do this, you will get ʻIndexErrorwhen you specify a number that exceeds the number of lines in the file, so I think it would be wise to implement using slices. I think the problem is that the input value is not checked and the error handling is not written in the first place ... Regarding the output, as explained in 11., we added,to the end of theprint` statement to remove unnecessary line breaks.

UNIX answer

$ head -3 hightemp.txt
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16 

Comments on UNIX Answers

This is also simple, you can specify the number of lines as an option.

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

Answer in Python

15.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 15.py

import sys

with open(sys.argv[1]) as f:
    lines = f.readlines()

for line in lines[len(lines) - int(sys.argv[2]):]:
    print line,

Comments on Python Answers

It is almost the same as the previous 14. Although the slice specification is slightly complicated.

UNIX answer

$ tail -3 hightemp.txt

Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

Comments on UNIX Answers

Almost the same as head.

in conclusion

Since it has become long, I have divided the article about Chapter 2. Continue to Chapter 2, Part 2.

Recommended Posts

100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
Image processing with Python (Part 2)
Image processing with Python (Part 1)
Image processing with Python (Part 3)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
Image processing with Python 100 knock # 10 median filter
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
3. Natural language processing with Python 2-1. Co-occurrence network
Image processing with Python 100 knock # 12 motion filter
3. Natural language processing with Python 1-1. Word N-gram
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Language Processing Knock-88: 10 Words with High Similarity
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
Python: Natural language processing
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
Image processing with Python
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization