[PYTHON] I tried 100 language processing knock 2020: Chapter 2

Introduction

I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here. (Confirmation using UNIX has not been verified.)

Chapter 2

No.10 Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

Answer

010.py


path = "popular-names.txt"
with open(path) as file:
    print(len(file.readlines()))

# -> 2700
Comments

I used the with block because it was tedious to writeclose ()at the end of the file operation.

No.11 Convert tabs to spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

Answer

011.py


path = "popular-names.txt"
with open(path) as file:
     print(file.read().replace("\t", " "), end="")

# -> Mary F 7065 1880 
#    Anna F 2604 1880 
#    Emma F 2003 1880 ...
Comments
I converted the space to `\ t` with` replace () `. If you output as it is, one line blank will occur at the line break of the `print` function +` \ n` at the end of the text line, but if you specify the ʻend` option for the` print` function, ʻend` will be inserted at the end of the text. I tried using it because it looks like.

No.12 Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

Answer

012.py


path = "popular-names.txt"
path_col1 = "col1_012.txt"
path_col2 = "col2_012.txt"

with open(path) as file:
    with open(path_col1, mode="w") as col1:
        with open(path_col2, mode="w") as col2:
           item_split = [item.split("\t") for item in file.readlines()]
           for item in item_split:
               col1.write(item[0] + "\n")
               col2.write(item[1] + "\n")

# col1.txt
# -> Mary
#    Anna...
# col2.txt
# -> F
#    F...
Comments

The operation of the file is specified by mode of ʻopen (). The default is mode ='r'`, but you should write it properly without omitting it ...?

No.13 Merge col1.txt and col2.txt

Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

Answer

013.py


path_col1 = "col1_012.txt"
path_col2 = "col2_012.txt"
path_merge = "merge.txt"

with open(path_col1) as col1:
    col1_list = col1.readlines()
    with open(path_col2) as col2:
        col2_list = col2.readlines()
        with open(path_merge, mode="w") as mrg:
            for i in range(len(col1_list)):
                mrg.write(col1_list[i].replace("\n", "") + "\t" + col2_list[i])

# merge.txt
# -> Mary	F
#    Anna	F
#    Emma	F
Comments

The other person's answer used zip () to generate it. With this answer, I'm in trouble when len (col1_list)> len (col2_list), and that is smarter.

No.14 Output N lines from the beginning

Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.

Answer

014.py


import sys

N = int(sys.argv[2])
with open(sys.argv[1]) as file:
    for i in range(N):
        print(file.readline().replace("\n",""))

# python 014.py popular-names.txt 3
# -> Mary    F       7065    1880
#    Anna    F       2604    1880
#    Emma    F       2003    1880
Comments

It seems that you can get a list with command line arguments by using the ʻargvfunction of thesys` module.

No.15 Output N lines at the end

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

Answer

015.py


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df.tail())

# ->              0  1      2     3
#    2775  Benjamin  M  13381  2018
#    2776    Elijah  M  12886  2018
#    2777     Lucas  M  12585  2018
#    2778     Mason  M  12435  2018
#    2779     Logan  M  12352  2018

Comments

For those who know it, it seems like it's new, but it seems that there is a library called pandas that is convenient for data processing, so I tried using it. read_csv (path, sep =" \ t ") was fine, but read_table is simple, isn't it?

No.16 Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

Answer

016.py


import pandas as pd
import sys

N = int(sys.argv[1])
path = "popular-names.txt"
df = pd.read_table(path, header=None)
col_n = -(-len(df) // N)
for i in range(N):
    print(df.iloc[col_n * i : col_n * (i + 1), :])

# python 016.py 2
# ->               0  1      2     3
#    0          Mary  F   7065  1880
#    1          Anna  F   2604  1880
#    ...         ... ..    ...   ...
#    1389     Sharon  F  25711  1949
#
#    [1390 rows x 4 columns]
#                  0  1      2     3
#     1390     James  M  86857  1949
#     1391    Robert  M  83872  1949
#     ...        ... ..    ...   ...
#     2779     Logan  M  12352  2018
#
#    [1390 rows x 4 columns]

Comments
`col_n =-(-len (df) // N)` calculates the rounded-up integer of` len (df) / N`. It's more intuitive to use `math.ceil ()`, but I wonder if this kind of notation is possible.

For the output, I used ʻiloc because I want to specify multiple lines of df` by index.

No.17 Overlapping of character strings in the first column

Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.

Answer

017.py


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df[0].unique())

# -> ['Mary' 'Anna' 'Emma' 'Elizabeth' 'Minnie' 'Margaret' 'Ida' 'Alice'...
Comments

ʻUnique ()returns the value of a unique element as a NumPyndarraytype. The number of unique elements can be obtained bydf [0] .nunique ()in addition tolen (df [0] .unique ())`.

No.18 Sort each row in descending order of the numerical value in the third column

Arrange each line in the reverse order of the numbers in the third column (Note: Sort the contents of each line unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

Answer

018.py


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df.sort_values(2, ascending=False))

# ->            0  1      2     3
#   1340    Linda  F  99689  1947
#   1360    Linda  F  96211  1948
#   1350    James  M  94757  1947...
Comments
`sort_values` is available in both` pandas.DataFrame` and `pandas.Series`. Very convenient to sort easily. Also, personally, I'm more of a "column" group than a "column".

No.19 Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

Answer

019.py


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df[0].value_counts())

# -> James      118
#    William    111
#    John       108
Comments

value_counts outputs unique elements and their number as pandas.Series type. It's confusing that ʻunique ()shows a list of unique elements,nunique ()shows the total number of unique elements, andvalue_counts ()` shows the frequency of each element.

reference

Frequent Pandas basic operations in data analysis upura / nlp100v2020 Solve "100 Language Processing Knock 2020" with Python Amateur language processing 100 knock summary

Recommended Posts

I tried 100 language processing knock 2020: Chapter 3
I tried 100 language processing knock 2020: Chapter 1
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
I tried 100 language processing knock 2020
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 Language Processing Knock (2020): 28
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
I tried natural language processing with transformers.
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break