[PYTHON] 100 language processing knocks (2020): 10-19

Chapter 2: UNIX Commands

10. Counting the number of lines


wc popular-names.txt 
    # 2780   11120   55026 popular-names.txt
wc -l popular-names.txt 
    # 2780 popular-names.txt

11. Replace tabs with spaces

# Differences between sed on Mac OSX and other “standard” sed? https://unix.stackexchange.com/questions/13711/differences-between-sed-on-mac-osx-and-other-standard-sed
# GNU sed interprets escape sequences like \t, \n, \001, \x01, \w, and \b. 
# OS X's sed and POSIX sed only interpret \n (but not in the replacement part of s).

# We should install the GNU sed: https://medium.com/@bramblexu/install-gnu-sed-on-mac-os-and-set-it-as-default-7c17ef1b8f64
# > is used to overwrite (“clobber”) a file and >> is used to append to a file.


# sed method 1
sed -E 's/\t/ /g' popular-names.txt > popular-names-space.txt

# sed method 2
sed -E 's/[[:space:]]/ /g' popular-names.txt > popular-names-space.txt

# tr method
tr -s '\t' ' ' < popular-names.txt > popular-names-space.txt

# expand method
expand -t 1 popular-names.txt > popular-names-space.txt 

12. Save the first column in col1.txt and the second column in col2.txt

cut -d $'\t' -f1 popular-names.txt > col1.txt
cut -d $'\t' -f2 popular-names.txt > col2.txt

cut -d ' ' -f1 popular-names-space.txt > col1.txt
cut -d ' ' -f2 popular-names-space.txt > col2.txt

13. Merge col1.txt and col2.txt

paste col1.txt col2.txt > merge_test.txt

14. Output N lines from the beginning

head -n 10 merge_test.txt

15. Output the last N lines

tail -n 10 merge_test.txt

16. Divide the file into N

# This solution is based on mac os, which not support -d -n option.
# The GNU split support -d -n option.


# -l: line_count. Create files that are smaller than 500 lines in length.
# -a: suffix_length. The splited file cannot have names like split01.txt, split02.txt. Otherwise, we have to use the GNU split.
# split- prefix: name

split -l 500 -a 1  popular-names.txt split-

# split-a
# split-b
# split-c

17. Difference in the character string in the first column

cut -d $'\t' -f1 popular-names.txt | sort | uniq > unique_names.txt

18. Sort each row in descending order of the numbers in the third column

sort -k 3 -n -r popular-names.txt > popular-names-sorted.txt
# -k 3: sort as the 3rd column
# -n: numeric sort
# -r: reverse order

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

cut -d $'\t' -f1 popular-names.txt | sort | uniq -c | sort -k1nr > name_frequency.txt

# cut -d $'\t' -f1 popular-names.txt | sort | uniq -c 
#   17 Abigail
#    3 Aiden
#    8 Alexander
#    8 Alexis

# uniq -c 
# -c: output the count of unique names in the 1st column

# sort -k1nr
#   -k1: sort by first column
#   n: numeric sort
#   r: descending order

Recommended Posts

100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 32
100 language processing knocks (2020): 35
100 language processing knocks (2020): 47
100 language processing knocks (2020): 39
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 06
100 amateur language processing knocks: 84
100 amateur language processing knocks: 81
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 94
100 amateur language processing knocks: 54
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12