[PYTHON] 100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)

"Chapter 2: Basics of UNIX Commands" of Language Processing 100 Knock 2015 It is a record of ecei.tohoku.ac.jp/nlp100/#ch2). Chapter 2 is related to CSV file operations. This is a review of what I did over a year ago. At the time I did it, I thought "Python is fine without using UNIX commands", but when dealing with large files, UNIX commands are generally faster. ** UNIX commands are worth remembering **. This time, I'm using a lot of Pandas packages for the Python part. It is really convenient for handling matrix data such as CSV.

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

Chapter 2: UNIX Command Basics

content of study

Experience useful UNIX tools for research and data analysis. Through these reimplements, you will experience the ecosystem of existing tools while improving your programming skills.

head, tail, cut, paste, split, sort, uniq, sed, tr, expand

Knock content

hightemp.txt records the highest temperature in Japan as "prefecture", "point", and "℃". It is a file stored in the tab-delimited format of "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

[010. Counting lines.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3%E3 % 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 010.% E8% A1% 8C% E6% 95% B0% E3% 81% AE% E3% 82% AB% E3% 82% A6% E3% 83% B3% E3% 83% 88.ipynb)

In Python, it should be the fastest to read at once with readlines (I haven't checked much).

Python part


print(len(open('./hightemp.txt').readlines()))

Terminal output result


24

wc is an abbreviation for Word Count. The -l option counts line feed codes. ** Large files come in handy because it takes time just to open them with a text editor **.

Bash part


wc hightemp.txt -l

Terminal output result


24 hightemp.txt

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

[011. Replace tabs with spaces.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 011.% E3% 82% BF% E3% 83% 96% E3% 82% 92% E3% 82% B9 % E3% 83% 9A% E3% 83% BC% E3% 82% B9% E3% 81% AB% E7% BD% AE% E6% 8F% 9B.ipynb)

Replace using the replace function. I use pprint because it is difficult to see the result without line breaks.

Python part


from pprint import pprint

with open('./hightemp.txt') as f:
    pprint([line.replace('\t', ' ')for line in f])

Terminal output result


['Kochi Prefecture Ekawasaki 41 2013-08-12\n',
 '40 Kumagaya, Saitama Prefecture.9 2007-08-16\n',
 '40 Tajimi, Gifu Prefecture.9 2007-08-16\n',

Omission

 'Yamanashi Prefecture Otsuki 39.9 1990-07-19\n',
 '39 Tsuruoka, Yamagata Prefecture.9 1978-08-03\n',
 'Aichi Prefecture Nagoya 39.9 1942-08-02\n']

sed can replace strings and delete lines. Executing this command only prints the result to the terminal, it does not update the contents of the file. I referred to "[sed] Replace character strings and delete lines".

Bash part


sed 's/\t/ /g' ./hightemp.txt

Terminal output result


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16

Omission

Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02

Save the 12.1st column in col1.txt and the 2nd column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

[012.1 Save the first column in col1.txt and the second column in col2.txt.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3 % 83% 9E% E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 012.1% E5% 88% 97% E7% 9B% AE% E3 % 82% 92col1.txt% E3% 81% AB% EF% BC% 8C2% E5% 88% 97% E7% 9B% AE% E3% 82% 92col2.txt% E3% 81% AB% E4% BF% 9D % E5% AD% 98.ipynb)

I used Pandas. The columns read by the parameter ʻusecols` are limited to the 1st and 2nd columns. It's convenient.

Python part


import pandas as pd

df = pd.read_table('./hightemp.txt', header=None, usecols=[0, 1])
df[0].to_csv('012.col1.txt',index=False, header=False)
df[1].to_csv('012.col2.txt',index=False, header=False)

Check the contents with cut. I referred to "[cut] command-cut out from a line in fixed length or field units".

Bash part


cut -f 1 ./hightemp.txt
cut -f 2 ./hightemp.txt

Terminal output result(1st row)


Kochi Prefecture
Saitama
Gifu Prefecture

Omission

Yamanashi Prefecture
Yamagata Prefecture
Aichi prefecture

Terminal output result(2nd row)


Ekawasaki
Kumagaya
Tajimi
Yamagata

Omission

Otsuki
Tsuruoka
Nagoya

Merge 13.col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

[Merge 013.col1.txt and col2.txt.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 013.col1.txt% E3% 81% A8col2.txt% E3% 82% 92% E3% 83% 9E% E3% 83% BC% E3% 82% B8.ipynb)

I read and connect two files with pandas.

Python part


import pandas as pd

result = pd.read_csv('012.col1.txt', header=None)
result[1] = pd.read_csv('012.col2.txt', header=None)

result.to_csv('013.col1_2.txt', index=False, header=None, sep='\t')

I referred to "A detailed summary of paste commands [Linux command collection]". The output result is omitted.

Bash part


paste 012.col1.txt 012.col2.txt

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

[014. Output N lines from the beginning.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3 % E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 041.4% E5% 85% 88% E9% A0% AD% E3% 81% 8B% E3% 82% 89N% E8% A1% 8C% E3% 82% 92% E5% 87% BA% E5% 8A% 9B.ipynb)

The argument is received by the ʻinput` function.

Python part


from pprint import pprint

n = int(input('N Lines--> '))

with open('hightemp.txt') as f:
    for i, line in enumerate(f):
        if i < n:
            pprint(line)
        else:
            break

Terminal output result


'Kochi Prefecture\t Ekawasaki\t41\t2013-08-12\n'
'Saitama\t Kumagaya\t40.9\t2007-08-16\n'
'Gifu Prefecture\t Tajimi\t40.9\t2007-08-16\n'

I referred to "Detailed summary of head command displayed from the beginning of the file [Linux command collection]".

Bash part


head hightemp.txt -n 3

Terminal output result


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

[015. Output the last N lines.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3 % E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 015.% E6% 9C% AB% E5% B0% BE% E3% 81% AEN% E8% A1% 8C% E3% 82% 92% E5% 87% BA% E5% 8A% 9B.ipynb)

This was quite confusing. I didn't want to read all the files when the file was large, so after checking all the files with the Linux head, linecache package I thought I would use .html), but if so, tail is fine. I ended up using readlines.

Python part


from pprint import pprint

n = int(input('N Lines--> '))

with open('hightemp.txt') as f:
    pprint(f.readlines()[-n:])

Terminal output result


['Yamanashi Prefecture\t otsuki\t39.9\t1990-07-19\n',
 'Yamagata Prefecture\t Tsuruoka\t39.9\t1978-08-03\n',
 'Aichi prefecture\t Nagoya\t39.9\t1942-08-02\n']

Bash part


tail hightemp.txt -n 3

Terminal output result


Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

[016. Divide the file into N.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 016.% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB % E3% 82% 92N% E5% 88% 86% E5% 89% B2% E3% 81% 99% E3% 82% 8B.ipynb)

The quotient is rounded up using the ceil function of the math package. File writing is added all at once with the writelines function.

Python part


import math

n = int(input('N spilits--> '))

with open('./hightemp.txt') as f:
    lines = f.readlines()

unit = math.ceil(len(lines) / n)

for i in range(0, n):
    with open('016.hightemp{}.txt'.format(i), 'w') as out_file:
        out_file.writelines(lines[i*unit:(i+1)*unit])

I referred to "[split] command-split files".

Bash part


split -n 3 -d hightemp.txt 016.hightemp-u

Differences in the character strings in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

[017. Difference in the character string in the first column.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83 % B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 017.% EF% BC% 91% E5% 88% 97% E7% 9B% AE% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E7% 95% B0% E3% 81% AA% E3% 82% 8A.ipynb)

I used the ʻunique function of pandas. pandas` makes this kind of processing very easy.

Python part


import pandas as pd

df = pd.read_table('hightemp.txt', header=None, usecols=[0])

print(df[0].unique())

Terminal output result


['Kochi Prefecture' 'Saitama' 'Gifu Prefecture' 'Yamagata Prefecture' 'Yamanashi Prefecture' 'Wakayama Prefecture' 'Shizuoka Prefecture' 'Gunma Prefecture' 'Aichi prefecture' 'Chiba' 'Ehime Prefecture' 'Osaka']

I referred to "Sort command summary [Linux command collection]".

Bash part


cut --fields=1 hightemp.txt | sort | uniq > result.txt

Terminal output result


Chiba
Wakayama Prefecture
Saitama
Osaka
Yamagata Prefecture
Yamanashi Prefecture
Gifu Prefecture
Ehime Prefecture
Aichi prefecture
Gunma Prefecture
Shizuoka Prefecture
Kochi Prefecture

18. Sort each line in descending order of the numerical value in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

[018. Sort each line in descending order of the numbers in the third column.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E% E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 018.% E5% 90% 84% E8% A1% 8C% E3% 82% 923 % E3% 82% B3% E3% 83% A9% E3% 83% A0% E7% 9B% AE% E3% 81% AE% E6% 95% B0% E5% 80% A4% E3% 81% AE% E9 % 99% 8D% E9% A0% 86% E3% 81% AB% E3% 82% BD% E3% 83% BC% E3% 83% 88.ipynb)

I used the sort_values function of pandas.

Python part


import pandas as pd

df = pd.read_table('hightemp.txt', header=None, usecols=[0])

print(df.sort_values(2, ascending=False))

Terminal output result


       0     1     2           3
0 Kochi Prefecture Ekawasaki 41.0  2013-08-12
2 Tajimi, Gifu Prefecture 40.9  2007-08-16
1 Kumagaya, Saitama Prefecture 40.9  2007-08-16

Omission

21 Yamanashi Prefecture Otsuki 39.9  1990-07-19
22 Yamagata Prefecture Tsuruoka 39.9  1978-08-03
23 Nagoya, Aichi 39.9  1942-08-02

Bash part


sort hightemp.txt -k 3 -n -r

Terminal output result


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Tajimi, Gifu Prefecture.9	2007-08-16
40 Kumagaya, Saitama Prefecture.9	2007-08-16

Omission

Toyonaka 39, Osaka.9	1994-08-08
39 Hatoyama, Saitama Prefecture.9	1997-07-05
39 Mobara, Chiba.9	2013-08-11

19. Find the frequency of appearance of the character string in the first column of each line and arrange it in descending order of frequency of appearance

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

[019. Find the frequency of appearance of the character string in the first column of each line and arrange it in descending order of frequency of appearance.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82 % B3% E3% 83% 9E% E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 019.% E5% 90% 84% E8% A1% 8C% E3% 81% AE1% E3% 82% B3% E3% 83% A9% E3% 83% A0% E7% 9B% AE% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 82% 92% E6% B1% 82% E3% 82% 81% EF% BC% 8C% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 81% AE% E9% AB% 98% E3% 81% 84% E9% A0% 86% E3% 81% AB% E4% B8% A6% E3% 81% B9% E3% 82% 8B.ipynb)

I used the value_counts function of pandas.

Python part


import pandas as pd

df = pd.read_table('hightemp.txt', header=None, usecols=[0])

print(df[0].value_counts(ascending=False))

Terminal output result


Saitama Prefecture 3
Yamanashi 3
Yamagata Prefecture 3

Omission

Ehime Prefecture 1
Kochi Prefecture 1
Osaka 1

Bash part


cut -f 1 hightemp.txt | sort | uniq -c | sort -r

Terminal output result


3 Gunma Prefecture
3 Yamanashi Prefecture
3 Yamagata Prefecture

Omission

1 Ehime prefecture
1 Osaka
1 Wakayama Prefecture

Recommended Posts

100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 language processing knock-74 (using scikit-learn): Prediction
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock (2020): 28
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing Knock (2020): 38
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
100 language processing knock 00 ~ 02
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
[Pandas] Basics of processing date data using dt
100 language processing knock-75 (using scikit-learn): weight of features