Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

Omitting import and argparse settings. If the number of lines M in the file is not exactly divisible by the given natural number N, It is a specification that gives one more line in order from the first divided part.

knock016.py


args = parser.parse_args()
N = args.line
filename = args.filename

#Show last N lines
f = open(filename)
lines = f.readlines()
M = len(lines)

#Merchandise and remainder
quotient = M/N
remainder = M - quotient*N

#Find the line that splits the file
num_of_lines = [quotient+1 if i < remainder else quotient for i in xrange(N)]
num_of_lines_cumulative = [sum(num_of_lines[:i+1]) for i in xrange(N)]

for i, line in enumerate(lines):
	if i in num_of_lines_cumulative:
		print
		print line.strip()
	else:
		print line.strip()

f.close()

UNIX command ... After adding the optional validation (although not enough), the code became longer.

knock016.sh


#!/bin/sh

#Receive the natural number N by means such as command line arguments, and divide the input file into N line by line.
#Achieve the same processing with the split command.
# ex.
# sh knock016.sh -f hightemp.txt -n 7

while getopts f:n: OPT
do
  case $OPT in
    "f" ) FLG_F="TRUE" ; INPUT_FILE=$OPTARG ;;
    "n" ) FLG_N="TRUE" ; N=$OPTARG ;;
      * ) echo "Usage: $CMDNAME [-f file name] [-n split number]" 1>&2
          exit 1 ;;
  esac
done

if [ ! "$FLG_F" = "TRUE" ]; then
  echo 'file name is not set.'
  exit 1
fi
if [ ! "$FLG_N" = "TRUE" ]; then
  echo 'split number is not set.'
  exit 1
fi

#INPUT_FILE="hightemp.txt"
TMP_HEAD="split/tmphead.$INPUT_FILE"
TMP_TAIL="split/tmptail.$INPUT_FILE"
SPLITHEAD_PREFIX="split/splithead."
SPLITTAIL_PREFIX="split/splittail."

M=$( wc -l < $INPUT_FILE )
#N=9
quotient=`expr \( $M / $N \)`
remainder=`expr \( $M - $quotient \* $N \)`

if [ $quotient -eq 0 ]; then
  echo "cannot divide: N is larger than the lines of the input file."
  exit 0
fi

if [ $remainder -eq 0 ]; then
  #If the remainder is 0, it will be in one file$Split to include quotient lines
  split -l $quotient $INPUT_FILE SPLITHEAD_PREFIX
else
  #If the remainder is non-zero,
  # (a)From the beginning(($quotient + 1) * $remainder)Line and(b)After that, divide it into 2 files
  split_head=`expr \( \( $quotient + 1 \) \* $remainder \)`
  split_tail=`expr \( $M - $split_head \)`
  head -n $split_head $INPUT_FILE > $TMP_HEAD
  tail -n $split_tail $INPUT_FILE > $TMP_TAIL

  # (a)In one file($quotient+1)line,(b)In one file$quotientline,含まれるように分割する
  split -l `expr \( $quotient + 1 \)` $TMP_HEAD $SPLITHEAD_PREFIX
  split -l $quotient $TMP_TAIL $SPLITTAIL_PREFIX

  rm -iv split/tmp*

fi

Since split is a command used by specifying the number of lines contained in one file, Impression that a little ingenuity was needed.

17. Difference in the character string in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

python


if __name__ == '__main__':

	f = open(filename)
	lines = f.readlines()

	# unlike problem 12., "+ '\n'" is not necessary
	content_col1 = [line.split()[0] for line in lines]
	content_col1_set = set(content_col1)
	print len(content_col1_set)

	for x in content_col1_set:
		print x

	f.close()

#>>>
#12
#Aichi prefecture
#Yamagata Prefecture
#Gifu Prefecture
#Chiba
#Saitama
#Kochi Prefecture
#Gunma Prefecture
#Yamanashi Prefecture
#Wakayama Prefecture
#Ehime Prefecture
#Osaka
#Shizuoka Prefecture

UNIX command. Do I have to do the same order ...?

python


awk -F'\t' '{print $1;}' hightemp.txt | sort | uniq
#>>>
#Chiba
#Wakayama Prefecture
#Saitama
#Osaka
#Yamagata Prefecture
#Yamanashi Prefecture
#Gifu Prefecture
#Ehime Prefecture
#Aichi prefecture
#Gunma Prefecture
#Shizuoka Prefecture
#Kochi Prefecture

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

python


if __name__ == '__main__':

	f = open(filename)
	lines = f.readlines()
	# reverse=True allows us to perform descending sort
	sorted_lines = sorted(lines, key=lambda line: float(line.split()[2]), reverse=True)

	for sorted_line in sorted_lines:
		print sorted_line,

	f.close()

#>>>
#Kochi Prefecture Ekawasaki 41 2013-08-12
#40 Kumagaya, Saitama Prefecture.9	2007-08-16
#40 Tajimi, Gifu Prefecture.9	2007-08-16
#Yamagata 40 Yamagata.8	1933-07-25
#Yamanashi Prefecture Kofu 40.7	2013-08-10
#Wakayama Prefecture Katsuragi 40.6	1994-08-08
#Shizuoka Prefecture Tenryu 40.6	1994-08-04
#40 Katsunuma, Yamanashi Prefecture.5	2013-08-10
#40 Koshigaya, Saitama Prefecture.4	2007-08-16
#Gunma Prefecture Tatebayashi 40.3	2007-08-16
#40 Kamisatomi, Gunma Prefecture.3	1998-07-04
#Aisai 40, Aichi Prefecture.3	1994-08-05
#Chiba Prefecture Ushiku 40.2	2004-07-20
#40 Sakuma, Shizuoka Prefecture.2	2001-07-24
#40 Uwajima, Ehime Prefecture.2	1927-07-22
#40 Sakata, Yamagata Prefecture.1	1978-08-03
#Gifu Prefecture Mino 40 2007-08-16
#Gunma Prefecture Maebashi 40 2001-07-24
#39 Mobara, Chiba.9	2013-08-11
#39 Hatoyama, Saitama Prefecture.9	1997-07-05
#Toyonaka 39, Osaka.9	1994-08-08
#Yamanashi Prefecture Otsuki 39.9	1990-07-19
#39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
#Aichi Prefecture Nagoya 39.9	1942-08-02

UNIX command.

python


sort -k3r hightemp.txt

Specify the column with the k option. Add r and reverse order.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

python


from collections import defaultdict
from collections import Counter

...

if __name__ == '__main__':

	f = open(filename)
	lines = f.readlines()

	# extract 1st column
	content_col1 = [line.split()[0] for line in lines]
	
	# (1) defaultdict
	# http://docs.python.jp/2/library/collections.html#collections.defaultdict
	d = defaultdict(int)
	for col1 in content_col1:
		d[col1] += 1
	for word, cnt in sorted(d.items(), key=lambda x: x[1], reverse=True):
		print word, cnt

	print

	# (2) Counter
	# http://docs.python.jp/2/library/collections.html#collections.Counter
	counter = Counter(content_col1)
	for word, cnt in counter.most_common():
		print word, cnt

	f.close()

#>>>
#Yamagata Prefecture 3
#Saitama Prefecture 3
#Gunma Prefecture 3
#Yamanashi 3
#Aichi 2
#Gifu prefecture 2
#Chiba 2
#Shizuoka Prefecture 2
#Kochi Prefecture 1
#Wakayama Prefecture 1
#Ehime Prefecture 1
#Osaka 1

#Yamagata Prefecture 3
#Saitama Prefecture 3
#Gunma Prefecture 3
#Yamanashi 3
#Aichi 2
#Gifu prefecture 2
#Chiba 2
#Shizuoka Prefecture 2
#Kochi Prefecture 1
#Wakayama Prefecture 1
#Ehime Prefecture 1
#Osaka 1

Whether to count with the defaultdict type as in (1) As in (2), do you use the Counter itself? There is a most_common () method ...

Then UNIX command.

python


cut -f 1 hightemp.txt | sort | uniq -c | sort -nr
#>>>
#3 Gunma Prefecture
#3 Yamanashi Prefecture
#3 Yamagata Prefecture
#3 Saitama Prefecture
#2 Shizuoka Prefecture
#2 Aichi prefecture
#2 Gifu Prefecture
#2 Chiba
#1 Kochi prefecture
#1 Ehime prefecture
#1 Osaka
#1 Wakayama Prefecture

It's an idiom-like command that I often use, so I want to remember it well. Sort by sort, and if there is the same thing in the adjacent line with uniq, put it together, Use the -c option to count such duplicate rows "sort -nr" sorts the rows as numbers (in descending order).

Recommended Posts

Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock Chapter 1 in Python
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (Second Half)
After doing 100 language processing knock 2015, I got a lot of basic Python skills Chapter 1
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 natural language processing knocks Chapter 1 Preparatory movement (second half)
100 natural language processing knocks Chapter 4 Morphological analysis (second half)
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 natural language processing knocks Chapter 3 Regular expressions (second half)
100 natural language processing knocks Chapter 6 English text processing (second half)
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 2 [First half: 10 ~ 15]
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
Speed comparison of Wiktionary full text processing with F # and Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
10 functions of "language with battery" python
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
Coexistence of Python2 and 3 with CircleCI (1.0)
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
I have 0 years of programming experience and challenge data processing with python
Basics of binarized image processing with Python