Introduction

Language processing 100 knock 2020 has been released, so I will try it immediately. Chapter 1 is exactly the same as 2015 (and I was doing just that), so I'll start with Chapter 2.

Some articles have already been published on Qiita, but in addition to being able to learn an overview of natural language processing, I think that the beginning will be useful not only for language processing but also for Linux beginners and professional beginners.

Chapter 2: UNIX Commands

popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

`code`


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(len(df.index))

`command`


wc -l popular-names.txt

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

`code`


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.to_csv('popular-names-space.txt', sep=' ', index=False, header=None)

`command`


sed -e $'s/\t/ /g' popular-names.txt > popular-names-space.txt

The regular expression part is for Mac (BSD). I don't think it will work on Linux.

12. Save the first column in col1.txt and the second column in col2.txt

Save only the first column of each row as col1.txt and the second column as col2.txt. Use the cut command for confirmation.

`code`


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df[0].to_csv('col1.txt', index=False, header=None)
df[1].to_csv('col2.txt', index=False, header=None)

`command`


cut -f 1 popular-names.txt > col1.txt
cut -f 2 popular-names.txt > col2.txt

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12, and create a text file in which the first and second columns of the original file are arranged by tab delimiters. Use the paste command for confirmation.

`code`


import pandas as pd

df1 = pd.read_csv('col1.txt', header=None)
df2 = pd.read_csv('col2.txt', header=None)
df_concat = pd.concat([df1, df2], axis=1)
df_concat.to_csv('col3.txt', sep='\t', index=False, header=None)

`command`


paste col1.txt col2.txt > col3.txt

14. Output N lines from the beginning

Receive the natural number N by means such as command line arguments, and display only the first N lines of the input. Use the head command for confirmation.

`code`


import pandas as pd
n = int(input())

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df.head(n))

`command`


head -5 popular-names.txt

15. Output the last N lines

Receive the natural number N by means such as a command line argument, and display only the last N lines of the input. Use the tail command for confirmation.

`code`


import pandas as pd
n = int(input())

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df.tail(n))

`command`


tail -5 popular-names.txt

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

`code`


import pandas as pd
n = int(input())

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)

for i in range(1, len(df) // n + 1):
    df[n*i:n*i+n:].to_csv('popular-names' + str(i) + '.txt', index=False, header=None)

It's not very beautiful here, but it didn't seem like a concise way to split the DataFrame line by line.

`command`


split -l 200 popular-names.txt popular-names-

17. Difference in the character string in the first column

Find the type of character string in the first column (set of different character strings). Use the cut, sort, and uniq commands for confirmation.

`code`


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(set(list(df[0])))

`command`


cut -f 1 popular-names.txt | sort | uniq

ʻUniq` needs to be sorted in advance.

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

`code`


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df.sort_values(2, ascending=False))

`command`


sort -n -r -k 3 popular-names.txt | head -10

Since the output will be long, we have narrowed down to 10 lines with head -10.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

`code`


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df[0].value_counts())

`command`


cut -f 1 popular-names.txt | sort | uniq -c | sort -n -r -k 1 | head -10

Since the output will be long, we have narrowed down to 10 lines with head -10.

in conclusion

What you can learn in Chapter 2

Basics of pandas.DataFrame
String / file operations with Unix commands
- sed
- cut
- paste
- head
- tail
- split
- sort
- uniq

[PYTHON] 100 Language Processing Knock 2020 Chapter 2

Introduction

Chapter 2: UNIX Commands

10. Counting the number of lines

code

command

11. Replace tabs with spaces

code

command

12. Save the first column in col1.txt and the second column in col2.txt

code

command

13. Merge col1.txt and col2.txt

code

command

14. Output N lines from the beginning

code

command

15. Output the last N lines

code

command

16. Divide the file into N

code

command

17. Difference in the character string in the first column

code

command

18. Sort each row in descending order of the numbers in the third column

code

command

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

code

command

in conclusion

`code`

`command`

`code`

`command`

`code`

`command`

`code`

`command`

`code`

`command`

`code`

`command`

`code`

`command`

`code`

`command`

`code`

`command`

`code`

`command`