Python inexperienced person tries to knock 100 language processing 10 ~ 13

Chapter 2 from today. This is a continuation of this. Python inexperienced person tries to knock 100 language processing 07-09 https://qiita.com/earlgrey914/items/a7b6781037bc0844744b

When I said "it took 7 hours" in Chapter 1, I was asked "What's your job?" Of course I do.


Preparation

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

Use this hightemp.txt as an input file -Write a Python program that performs processing -Try the same processing (command execution) with UNIX commands That seems to be the content of Chapter 2.

The contents of hightemp.txt look like this. Tab-delimited 24-by-4 data.

hightemp.txt


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16
Yamagata 40 Yamagata.8	1933-07-25
Yamanashi Prefecture Kofu 40.7	2013-08-10
Wakayama Prefecture Katsuragi 40.6	1994-08-08
Shizuoka Prefecture Tenryu 40.6	1994-08-04
40 Katsunuma, Yamanashi Prefecture.5	2013-08-10
40 Koshigaya, Saitama Prefecture.4	2007-08-16
Gunma Prefecture Tatebayashi 40.3	2007-08-16
40 Kamisatomi, Gunma Prefecture.3	1998-07-04
Aisai 40, Aichi Prefecture.3	1994-08-05
Chiba Prefecture Ushiku 40.2	2004-07-20
40 Sakuma, Shizuoka Prefecture.2	2001-07-24
40 Uwajima, Ehime Prefecture.2	1927-07-22
40 Sakata, Yamagata Prefecture.1	1978-08-03
Gifu Prefecture Mino 40 2007-08-16
Gunma Prefecture Maebashi 40 2001-07-24
39 Mobara, Chiba.9	2013-08-11
39 Hatoyama, Saitama Prefecture.9	1997-07-05
Toyonaka 39, Osaka.9	1994-08-08
Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

I'm using AWS Cloud9 as the Python execution environment, so Start after uploading this txt file there.

As an aside, Cloud9 is really useful. I'm happy to be able to develop native GUI apps on Cloud9 (I'm saying something strange).

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

Well. First of all, how to read a txt file with Python. I know this. I put .txt in the same place as .py, so this should be okay.

yomikoku.py


with open('hightemp.txt') as f:
    s = f.read()
    print(s)
Traceback (most recent call last):
  File "/home/ec2-user/knock/02/enshu11.py", line 6, in <module>
    with open('hightemp.txt') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'hightemp.txt'

Oh. It's no good.

~ 3 minutes google ~


Reference URL
https://qiita.com/nagamee/items/b7d1b02074293fdfdfff

korede.py


import os.path

#The origin is the location of this py file
os.chdir((os.path.dirname(os.path.abspath(__file__))))

with open('hightemp.txt') as f:
    s = f.read()
    print(s)
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9    2007-08-16
:

This is OK. This ʻos.chdir ((os.path.dirname (os.path.abspath (__ file__))))) `magic? Is it okay to write in the future? Depending on the execution environment, it may be necessary or unnecessary (or not written) ...

So, there seem to be several ways to read the contents after ʻopen ()the file. In this problem, it says "count the number of lines", so it's better to usereadlines ()` which lists each line.

enshu10.py


import os.path

#The origin is the location of this py file
os.chdir((os.path.dirname(os.path.abspath(__file__))))

with open('hightemp.txt') as f:
    s = f.readlines()
    print(len(s))
24

It's easy. It is said that the same thing should be done with UNIX commands, so execute it.

[ec2-user@ip-172-31-34-215 02]$ wc -l hightemp.txt 
24 hightemp.txt

The file name is in the way. Let's bite cat.

[ec2-user@ip-172-31-34-215 02]$ cat hightemp.txt | wc -l
24

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

Isn't it easier than Chapter 1?

eunshu11.py


import os.path

os.chdir((os.path.dirname(os.path.abspath(__file__))))

with open('hightemp.txt', mode="r") as f:
    s = f.read()
    tikango = s.replace("\t", " ") 
    
with open('hightemp.txt', mode="w") as f:
    f.write(tikango)

hightemp.txt


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Yamagata 40 Yamagata.8 1933-07-25
Yamanashi Prefecture Kofu 40.7 2013-08-10
Wakayama Prefecture Katsuragi 40.6 1994-08-08
Shizuoka Prefecture Tenryu 40.6 1994-08-04
40 Katsunuma, Yamanashi Prefecture.5 2013-08-10
40 Koshigaya, Saitama Prefecture.4 2007-08-16
Gunma Prefecture Tatebayashi 40.3 2007-08-16
40 Kamisatomi, Gunma Prefecture.3 1998-07-04
Aisai 40, Aichi Prefecture.3 1994-08-05
Chiba Prefecture Ushiku 40.2 2004-07-20
40 Sakuma, Shizuoka Prefecture.2 2001-07-24
40 Uwajima, Ehime Prefecture.2 1927-07-22
40 Sakata, Yamagata Prefecture.1 1978-08-03
Gifu Prefecture Mino 40 2007-08-16
Gunma Prefecture Maebashi 40 2001-07-24
39 Mobara, Chiba.9 2013-08-11
39 Hatoyama, Saitama Prefecture.9 1997-07-05
Toyonaka 39, Osaka.9 1994-08-08
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02

Try replacing it with sed in the terminal as well.

[ec2-user@ip-172-31-34-215 02]$ sed -i -e "s/\t/ /g" hightemp.txt
[ec2-user@ip-172-31-34-215 02]$ cat hightemp.txt 
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Yamagata 40 Yamagata.8 1933-07-25
Yamanashi Prefecture Kofu 40.7 2013-08-10
Wakayama Prefecture Katsuragi 40.6 1994-08-08
Shizuoka Prefecture Tenryu 40.6 1994-08-04
40 Katsunuma, Yamanashi Prefecture.5 2013-08-10
40 Koshigaya, Saitama Prefecture.4 2007-08-16
Gunma Prefecture Tatebayashi 40.3 2007-08-16
40 Kamisatomi, Gunma Prefecture.3 1998-07-04
Aisai 40, Aichi Prefecture.3 1994-08-05
Chiba Prefecture Ushiku 40.2 2004-07-20
40 Sakuma, Shizuoka Prefecture.2 2001-07-24
40 Uwajima, Ehime Prefecture.2 1927-07-22
40 Sakata, Yamagata Prefecture.1 1978-08-03
Gifu Prefecture Mino 40 2007-08-16
Gunma Prefecture Maebashi 40 2001-07-24
39 Mobara, Chiba.9 2013-08-11
39 Hatoyama, Saitama Prefecture.9 1997-07-05
Toyonaka 39, Osaka.9 1994-08-08
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-0

12. Save the first column in col1.txt and the second column in col2.txt

Save only the first column of each row as col1.txt and the second column as col2.txt. Use the cut command for confirmation.

I feel like it's getting easier at once.

enshu12.py


import os.path

os.chdir((os.path.dirname(os.path.abspath(__file__))))

with open('hightemp.txt', mode="r") as f:
    linedata = f.readlines()
    for l in linedata:
        with open('col1.txt', mode="a") as c1:
            c1.write(l.split(" ")[0] + "\r")
        with open('col2.txt', mode="a") as c2:
            c2.write(l.split(" ")[1] +"\r")

col1.txt


Kochi Prefecture
Saitama
Gifu Prefecture
Yamagata Prefecture
Yamanashi Prefecture
Wakayama Prefecture
Shizuoka Prefecture
Yamanashi Prefecture
Saitama
Gunma Prefecture
Gunma Prefecture
Aichi prefecture
Chiba
Shizuoka Prefecture
Ehime Prefecture
Yamagata Prefecture
Gifu Prefecture
Gunma Prefecture
Chiba
Saitama
Osaka
Yamanashi Prefecture
Yamagata Prefecture
Aichi prefecture

col2.txt


Ekawasaki
Kumagaya
Tajimi
Yamagata
Kofu
Katsuragi
Tenryu
Katsunuma
Koshigaya
Tatebayashi
Kamisatomi
Aisai
Ushiku
Sakuma
Uwajima
Sakata
Mino
Maebashi
Mobara
Hatoyama
Toyonaka
Otsuki
Tsuruoka
Nagoya

The cut command looks like this.

[ec2-user@ip-172-31-34-215 02]$ cut -f 1 -d " " hightemp.txt > col1_command.txt 
[ec2-user@ip-172-31-34-215 02]$ cut -f 2 -d " " hightemp.txt > col2_command.txt

Compare with diff ...

[ec2-user@ip-172-31-34-215 02]$ diff col1.txt col1_command.txt 
1c1,24
Aichi prefecture
\ No newline at end of file
---
>Kochi Prefecture
>Saitama
>Gifu Prefecture
>Yamagata Prefecture
>Yamanashi Prefecture
>Wakayama Prefecture
>Shizuoka Prefecture
>Yamanashi Prefecture
>Saitama
>Gunma Prefecture
>Gunma Prefecture
>Aichi prefecture
>Chiba
>Shizuoka Prefecture
>Ehime Prefecture
>Yamagata Prefecture
>Gifu Prefecture
>Gunma Prefecture
>Chiba
>Saitama
>Osaka
>Yamanashi Prefecture
>Yamagata Prefecture
>Aichi prefecture

Are! ?? This is because it is not displayed even with cat col1.txt ... ** Because of the line feed code! ** ** So I changed the line feed code from \ r to \ n and specified ʻUTF-8` as the encoding when writing the file.

enshu13.py


import os.path

os.chdir((os.path.dirname(os.path.abspath(__file__))))

with open('hightemp.txt', mode="r") as f:
    linedata = f.readlines()
    for l in linedata:
        with open('col1.txt', mode="a", encoding="utf-8") as c1:
            c1.write(l.split(" ")[0] + "\n")
        with open('col2.txt', mode="a", encoding="utf-8") as c2:
            c2.write(l.split(" ")[1] +"\n")

Execution confirmation

[ec2-user@ip-172-31-34-215 02]$ python3 enshu12.py
[ec2-user@ip-172-31-34-215 02]$ 
[ec2-user@ip-172-31-34-215 02]$ cut -f 1 -d " " hightemp.txt > col1_command.txt
[ec2-user@ip-172-31-34-215 02]$ cut -f 2 -d " " hightemp.txt > col2_command.txt
[ec2-user@ip-172-31-34-215 02]$ diff col1.txt col1_command.txt
[ec2-user@ip-172-31-34-215 02]$ diff col2.txt col2_command.txt
[ec2-user@ip-172-31-34-215 02]$ 

It's ok.

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12, and create a text file in which the first and second columns of the original file are arranged by tab delimiters. Use the paste command for confirmation.

Maybe it's like this, but is there a better way?

tabun.py



with open col1.txt
Put all rows in array 1

with open col2.txt
Put all rows in array 2

for[i]
Output file= write(Array 1[i] + "\t" +Array 2[i])

~ 20 minutes later ~

enshu13.py


import os.path

os.chdir((os.path.dirname(os.path.abspath(__file__))))

linedata_col1 = []
linedata_col2 = []

with open('col1.txt', mode="r") as f:
    linedata_col1 = f.read().splitlines()


with open('col2.txt', mode="r") as f:
    linedata_col2 = f.read().splitlines()

with open('merge.txt', mode="a", encoding="utf-8") as f:
    for c1, c2 in zip(linedata_col1, linedata_col2):
        f.write(c1 + "\t" + c2 + "\n")

merge.txt


Kochi Prefecture Ekawasaki
Kumagaya, Saitama Prefecture
Gifu Prefecture Tajimi
Yamagata Prefecture Yamagata
Yamanashi Prefecture Kofu
Wakayama Prefecture Katsuragi
Shizuoka Prefecture Tenryu
Yamanashi Prefecture Katsunuma
Koshigaya, Saitama Prefecture
Gunma Prefecture Tatebayashi
Kamisatomi, Gunma Prefecture
Aisai, Aichi Prefecture
Chiba Prefecture Ushiku
Sakuma, Shizuoka Prefecture
Uwajima, Ehime Prefecture
Yamagata Prefecture Sakata
Gifu Prefecture Mino
Gunma Prefecture Maebashi
Mobara, Chiba
Hatoyama, Saitama Prefecture
Toyonaka, Osaka
Yamanashi Prefecture Otsuki
Yamagata Prefecture Tsuruoka
Aichi Prefecture Nagoya

The point of ingenuity is linedata_col1 = f.read (). Splitlines (). ** It is ant to read line by line with f.readlines (), but then it will be a list including line feed code like ↓. ** **

readlinesdato.py


with open('col1.txt', mode="r") as f:
    linedata_col1 = f.readlines()
    print(linedata_col1)
['Kochi Prefecture\n', 'Saitama\n', 'Gifu Prefecture\n', 'Yamagata Prefecture\n', 'Yamanashi Prefecture\n', 'Wakayama Prefecture\n', 'Shizuoka Prefecture\n', 'Yamanashi Prefecture\n', 'Saitama\n', 'Gunma Prefecture\n', 'Gunma Prefecture\n', 'Aichi prefecture\n', 'Chiba\n', 'Shizuoka Prefecture\n', 'Ehime Prefecture\n', 'Yamagata Prefecture\n', 'Gifu Prefecture\n', 'Gunma Prefecture\n', 'Chiba\n', 'Saitama\n', 'Osaka\n', 'Yamanashi Prefecture\n', 'Yamagata Prefecture\n', 'Aichi prefecture\n']

I thought it would be best to use ** read () to read it as a block object including the line feed code, and to list it with split () with the line feed code, rather than to bother to erase this line feed code.

Then compare with paste.

[ec2-user@ip-172-31-34-215 02]$ python3 enshu13.py
[ec2-user@ip-172-31-34-215 02]$ paste col1.txt col2.txt > merge_command.txt
[ec2-user@ip-172-31-34-215 02]$ diff merge.txt merge_command.txt 
[ec2-user@ip-172-31-34-215 02]$ 

It's kind of easy, and the result verification has become troublesome because the file is sandwiched. Let's continue tomorrow ~ ** It took 2 hours so far! !! ** I'm doing it lazily, so I wonder if it will be very helpful this time.

Recommended Posts

Python inexperienced person tries to knock 100 language processing 14-16
Python inexperienced person tries to knock 100 language processing 07-09
Python inexperienced person tries to knock 100 language processing 10 ~ 13
Python inexperienced person tries to knock 100 language processing 05-06
Python inexperienced person tries to knock 100 language processing 00-04
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock with Python (Chapter 3)
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock Chapter 1 by Python
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
100 Language Processing Knock (2020): 28
100 language processing knock 00 ~ 02
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
Python: Natural language processing
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-92 (using Gensim): application to analogy data
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-43: Extracted clauses containing nouns related to clauses containing verbs
[Python] Try to classify ramen shops by natural language processing
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
Leave the troublesome processing to Python
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 language processing knock-76 (using scikit-learn): labeling
Introduction to Protobuf-c (C language ⇔ Python)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)