[PYTHON] 100 amateur language processing knocks: 16

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 2: UNIX Command Basics

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

The finished code:

main.py


# coding: utf-8
import math

fname = 'hightemp.txt'
n = int(input('N--> '))

with open(fname) as data_file:
	lines = data_file.readlines()

count = len(lines)
unit = math.ceil(count / n)  #Number of lines per file

for i, offset in enumerate(range(0, count, unit), 1):
	with open('child_{:02d}.txt'.format(i), mode='w') as out_file:
		for line in lines[offset:offset + unit]:
			out_file.write(line)

Execution result:

As an example, the result when N = 5 is shown. There are 24 lines in total, so if you divide it into 5, each file will have 5 lines, and only the last file will have 4 lines.

child_01.txt


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16
Yamagata 40 Yamagata.8	1933-07-25
Yamanashi Prefecture Kofu 40.7	2013-08-10

child_02.txt


Wakayama Prefecture Katsuragi 40.6	1994-08-08
Shizuoka Prefecture Tenryu 40.6	1994-08-04
40 Katsunuma, Yamanashi Prefecture.5	2013-08-10
40 Koshigaya, Saitama Prefecture.4	2007-08-16
Gunma Prefecture Tatebayashi 40.3	2007-08-16

child_03.txt


40 Kamisatomi, Gunma Prefecture.3	1998-07-04
Aisai 40, Aichi Prefecture.3	1994-08-05
Chiba Prefecture Ushiku 40.2	2004-07-20
40 Sakuma, Shizuoka Prefecture.2	2001-07-24
40 Uwajima, Ehime Prefecture.2	1927-07-22

child_04.txt


40 Sakata, Yamagata Prefecture.1	1978-08-03
Gifu Prefecture Mino 40 2007-08-16
Gunma Prefecture Maebashi 40 2001-07-24
39 Mobara, Chiba.9	2013-08-11
39 Hatoyama, Saitama Prefecture.9	1997-07-05

child_05.txt


Toyonaka 39, Osaka.9	1994-08-08
Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

Shell script for UNIX command confirmation:

test.sh


#!/bin/sh

#Enter N
echo -n "N--> "
read n

#Calculate the number of lines wc outputs the number of lines and the file name, so cut out only the number of lines with cut
count=`wc --line hightemp.txt | cut --fields=1 --delimiter=" "`

#Calculation of the number of lines per division If there is a remainder, calculate the number of lines+1
unit=`expr $count / $n`
remainder=`expr $count % $n`
if [ $remainder -gt 0 ]; then
	unit=`expr $unit + 1`
fi

#Split
split --lines=$unit --numeric-suffixes=1 --additional-suffix=.txt hightemp.txt child_test_

#Verification
for i in `seq 1 $n`
do
	fname=`printf child_%02d.txt $i`
	fname_test=`printf child_test_%02d.txt $i`
	diff --report-identical-files $fname $fname_test
done

Confirmation of results:

Here are some results.

When N = 1:

Terminal


segavvy@ubuntu:~/document/100 language processing knock 2015/16$ python main.py 
N--> 1
segavvy@ubuntu:~/document/100 language processing knock 2015/16$ ./test.sh
N--> 1
File child_01.txt and child_test_01.txt is the same

For N = 2:

Terminal


segavvy@ubuntu:~/document/100 language processing knock 2015/16$ python main.py 
N--> 2
segavvy@ubuntu:~/document/100 language processing knock 2015/16$ ./test.sh
N--> 2
File child_01.txt and child_test_01.txt is the same
File child_02.txt and child_test_02.txt is the same

For N = 5:

Terminal


segavvy@ubuntu:~/document/100 language processing knock 2015/16$ python main.py 
N--> 5
segavvy@ubuntu:~/document/100 language processing knock 2015/16$ ./test.sh
N--> 5
File child_01.txt and child_test_01.txt is the same
File child_02.txt and child_test_02.txt is the same
File child_03.txt and child_test_03.txt is the same
File child_04.txt and child_test_04.txt is the same
File child_05.txt and child_test_05.txt is the same

When N = 7:

Terminal


segavvy@ubuntu:~/document/100 language processing knock 2015/16$ python main.py 
N--> 7
segavvy@ubuntu:~/document/100 language processing knock 2015/16$ ./test.sh
N--> 7
File child_01.txt and child_test_01.txt is the same
File child_02.txt and child_test_02.txt is the same
File child_03.txt and child_test_03.txt is the same
File child_04.txt and child_test_04.txt is the same
File child_05.txt and child_test_05.txt is the same
File child_06.txt and child_test_06.txt is the same
diff: child_07.txt:There is no such file or directory
diff: child_test_07.txt:There is no such file or directory

In this program, only 6 divisions are made, so an error will occur that the 7th division file does not exist. This is due to the logic that if you try to divide all 24 lines into 7 lines, you will end up with 4 lines per file, and you will only need 6 files. Perhaps you have to split the four-line file into three and the three-line file into four. This code may be incorrect ... ^^;

For N = 24:

Terminal


segavvy@ubuntu:~/document/100 language processing knock 2015/16$ python main.py
N--> 24
segavvy@ubuntu:~/document/100 language processing knock 2015/16$ ./test.sh
N--> 24
File child_01.txt and child_test_01.txt is the same
File child_02.txt and child_test_02.txt is the same
File child_03.txt and child_test_03.txt is the same
File child_04.txt and child_test_04.txt is the same
File child_05.txt and child_test_05.txt is the same
File child_06.txt and child_test_06.txt is the same
File child_07.txt and child_test_07.txt is the same
File child_08.txt and child_test_08.txt is the same
File child_09.txt and child_test_09.txt is the same
File child_10.txt and child_test_10.txt is the same
File child_11.txt and child_test_11.txt is the same
File child_12.txt and child_test_12.txt is the same
File child_13.txt and child_test_13.txt is the same
File child_14.txt and child_test_14.txt is the same
File child_15.txt and child_test_15.txt is the same
File child_16.txt and child_test_16.txt is the same
File child_17.txt and child_test_17.txt is the same
File child_18.txt and child_test_18.txt is the same
File child_19.txt and child_test_19.txt is the same
File child_20.txt and child_test_20.txt is the same
File child_21.txt and child_test_21.txt is the same
File child_22.txt and child_test_22.txt is the same
File child_23.txt and child_test_23.txt is the same
File child_24.txt and child_test_24.txt is the same

For N = 25:

Terminal


segavvy@ubuntu:~/document/100 language processing knock 2015/16$ python main.py
N--> 25
segavvy@ubuntu:~/document/100 language processing knock 2015/16$ ./test.sh
N--> 25
File child_01.txt and child_test_01.txt is the same
File child_02.txt and child_test_02.txt is the same
File child_03.txt and child_test_03.txt is the same
File child_04.txt and child_test_04.txt is the same
File child_05.txt and child_test_05.txt is the same
File child_06.txt and child_test_06.txt is the same
File child_07.txt and child_test_07.txt is the same
File child_08.txt and child_test_08.txt is the same
File child_09.txt and child_test_09.txt is the same
File child_10.txt and child_test_10.txt is the same
File child_11.txt and child_test_11.txt is the same
File child_12.txt and child_test_12.txt is the same
File child_13.txt and child_test_13.txt is the same
File child_14.txt and child_test_14.txt is the same
File child_15.txt and child_test_15.txt is the same
File child_16.txt and child_test_16.txt is the same
File child_17.txt and child_test_17.txt is the same
File child_18.txt and child_test_18.txt is the same
File child_19.txt and child_test_19.txt is the same
File child_20.txt and child_test_20.txt is the same
File child_21.txt and child_test_21.txt is the same
File child_22.txt and child_test_22.txt is the same
File child_23.txt and child_test_23.txt is the same
File child_24.txt and child_test_24.txt is the same
diff: child_25.txt:There is no such file or directory
diff: child_test_25.txt:There is no such file or directory

Since there are only 24 lines in total, 25 divisions are not possible. I think this error can't be helped.

This time I struggled with shell scripts more than python, but I'm getting used to it a bit. I was surprised at the variety of UNIX commands, knowing that there is even a command called printf.   That's all for the 17th knock (Isn't the number of knocks written at the end always off by one? You pointed out, but since the first problem number of this knock is 0, it is called the problem number. Is off by one). If you have any mistakes, I would appreciate it if you could point them out.

Recommended Posts

100 amateur language processing knocks: 41
100 amateur language processing knocks: 71
100 amateur language processing knocks: 56
100 amateur language processing knocks: 24
100 amateur language processing knocks: 50
100 amateur language processing knocks: 59
100 amateur language processing knocks: 70
100 amateur language processing knocks: 62
100 amateur language processing knocks: 60
100 amateur language processing knocks: 92
100 amateur language processing knocks: 30
100 amateur language processing knocks: 84
100 amateur language processing knocks: 33
100 amateur language processing knocks: 46
100 amateur language processing knocks: 88
100 amateur language processing knocks: 89
100 amateur language processing knocks: 40
100 amateur language processing knocks: 45
100 amateur language processing knocks: 43
100 amateur language processing knocks: 55
100 amateur language processing knocks: 22
100 amateur language processing knocks: 61
100 amateur language processing knocks: 04
100 amateur language processing knocks: 63
100 amateur language processing knocks: 78
100 amateur language processing knocks: 12
100 amateur language processing knocks: 14
100 amateur language processing knocks: 08
100 amateur language processing knocks: 42
100 amateur language processing knocks: 19
100 amateur language processing knocks: 73
100 amateur language processing knocks: 75
100 amateur language processing knocks: 98
100 amateur language processing knocks: 83
100 amateur language processing knocks: 95
100 amateur language processing knocks: 32
100 amateur language processing knocks: 96
100 amateur language processing knocks: 87
100 amateur language processing knocks: 72
100 amateur language processing knocks: 79
100 amateur language processing knocks: 23
100 amateur language processing knocks: 05
100 amateur language processing knocks: 00
100 amateur language processing knocks: 02
100 amateur language processing knocks: 37
100 amateur language processing knocks: 21
100 amateur language processing knocks: 11
100 amateur language processing knocks: 90
100 amateur language processing knocks: 74
100 amateur language processing knocks: 66
100 amateur language processing knocks: 28
100 amateur language processing knocks: 34
100 amateur language processing knocks: 36
100 amateur language processing knocks: 77
100 amateur language processing knocks: 16
100 amateur language processing knocks: 27
100 amateur language processing knocks: 10
100 amateur language processing knocks: 03
100 amateur language processing knocks: 82
100 amateur language processing knocks: 69
100 amateur language processing knocks: 53