Language processing 100 knock 2020 has been released, so I will try it immediately. Chapter 1 is exactly the same as 2015 (and I was doing just that), so I'll start with Chapter 2.
Some articles have already been published on Qiita, but in addition to being able to learn an overview of natural language processing, I think that the beginning will be useful not only for language processing but also for Linux beginners and professional beginners.
popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
Count the number of lines. Use the wc command for confirmation.
code
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(len(df.index))
command
wc -l popular-names.txt
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
code
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.to_csv('popular-names-space.txt', sep=' ', index=False, header=None)
command
sed -e $'s/\t/ /g' popular-names.txt > popular-names-space.txt
Save only the first column of each row as col1.txt and the second column as col2.txt. Use the cut command for confirmation.
code
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df[0].to_csv('col1.txt', index=False, header=None)
df[1].to_csv('col2.txt', index=False, header=None)
command
cut -f 1 popular-names.txt > col1.txt
cut -f 2 popular-names.txt > col2.txt
Combine the col1.txt and col2.txt created in 12, and create a text file in which the first and second columns of the original file are arranged by tab delimiters. Use the paste command for confirmation.
code
import pandas as pd
df1 = pd.read_csv('col1.txt', header=None)
df2 = pd.read_csv('col2.txt', header=None)
df_concat = pd.concat([df1, df2], axis=1)
df_concat.to_csv('col3.txt', sep='\t', index=False, header=None)
command
paste col1.txt col2.txt > col3.txt
Receive the natural number N by means such as command line arguments, and display only the first N lines of the input. Use the head command for confirmation.
code
import pandas as pd
n = int(input())
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df.head(n))
command
head -5 popular-names.txt
Receive the natural number N by means such as a command line argument, and display only the last N lines of the input. Use the tail command for confirmation.
code
import pandas as pd
n = int(input())
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df.tail(n))
command
tail -5 popular-names.txt
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
code
import pandas as pd
n = int(input())
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
for i in range(1, len(df) // n + 1):
df[n*i:n*i+n:].to_csv('popular-names' + str(i) + '.txt', index=False, header=None)
It's not very beautiful here, but it didn't seem like a concise way to split the DataFrame line by line.
command
split -l 200 popular-names.txt popular-names-
Find the type of character string in the first column (set of different character strings). Use the cut, sort, and uniq commands for confirmation.
code
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(set(list(df[0])))
command
cut -f 1 popular-names.txt | sort | uniq
ʻUniq` needs to be sorted in advance.
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
code
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df.sort_values(2, ascending=False))
command
sort -n -r -k 3 popular-names.txt | head -10
head -10
.Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
code
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df[0].value_counts())
command
cut -f 1 popular-names.txt | sort | uniq -c | sort -n -r -k 1 | head -10
head -10
.What you can learn in Chapter 2
Recommended Posts