Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)

It's been almost a month since the last update ... It's been a good three-day shaved head.

As the title of the chapter shows, this is often done using UNIX commands, so It may be a little troublesome to write in python.

First, download the dataset ...

Chapter 2: UNIX Command Basics

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

python


filename = 'hightemp.txt'
f = open(filename, 'r')
print sum([1 for l in f])
#>>> 24

There seem to be various ways to do this ... http://551sornwmc.blog109.fc2.com/blog-entry-387.html http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python

In terms of memory usage and execution speed, it seems better to use this memory-mapped file.

python


# using memory mapped file
import mmap
def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

Click here for confirmation with UNIX commands.

python


wc -l hightemp.txt
#>>> 24

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

python


#import re

filename = 'hightemp.txt'
f = open(filename, 'r')
lines = f.readlines()
for line in lines:
    #line_replaced =  re.sub(r'\t', r'\s', line)
    line_replaced = line.expandtabs(1)
    print line_replaced,

There is expandtabs.

Click here for confirmation with UNIX commands.

python


cat hightemp.txt | tr '\t' ','

This ↑ seems to be the smoothest.

python


sed -e s/'\t'/'\s'/g hightemp.txt
#It doesn't work on Mac, so again
sed -e s/$'\t'/$'\s'/g hightemp.txt
#that?

http://mattintosh.hatenablog.com/entry/2013/01/16/143323

BSD sed included in Mac OS X etc. does not expand \ t in scripts to tabs like echo and printf.

Oh...

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

python


filename = 'hightemp.txt'
filename_col1 = 'col1.txt'
filename_col2 = 'col2.txt'

f = open(filename, 'r')
f_col1 = open(filename_col1, 'w')
f_col2 = open(filename_col2, 'w')

lines = f.readlines()

content_col1 = [line.split()[0] + '\n' for line in lines]
content_col2 = [line.split()[1] + '\n' for line in lines]

f_col1.writelines(content_col1)
f_col2.writelines(content_col2)

f_col1.close()
f_col2.close()

One thing to note is that the writelines method does not include line breaks, so Did you add it yourself?

Click here for confirmation with UNIX commands. Wow it's so easy that I feel nauseous.

python


cut -f1 hightemp.txt > col1.txt
cut -f2 hightemp.txt > col2.txt

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

python


filename_col1 = 'col1.txt'
filename_col2 = 'col2.txt'
filename_col1_col2 = 'col1_col2.txt'

f_col1 = open(filename_col1, 'r')
f_col2 = open(filename_col2, 'r')
f_col1_col2 = open(filename_col1_col2, 'w')

lines_1 = f_col1.readlines()
lines_2 = f_col2.readlines()

content = [line1 + '\t' + line2 + '\n' for line1, line2 in zip(lines_1, lines_2)]

f_col1_col2.writelines(content)
f_col1_col2.close()    

f_col1.close()    
f_col2.close()    

Click here for confirmation with UNIX commands. It was too easy and I vomited.

python


paste col1.txt col2.txt > col1_col2.txt

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

knock014.py


# -*- coding: utf-8 -*-

import sys
import argparse

parser = argparse.ArgumentParser(description='Head command. Accepts an integer and a file name.')

#Number of lines
parser.add_argument(
	'-l', '--line',
	type = int,
	dest = 'line',
	default = 10,
	help = 'Equivalent to the number of lines specified by the head command'
)

#file name
parser.add_argument(
    '-f', '--filename',
    type = str,						#Specify the type of value to receive
    dest = 'filename',     			#Save destination variable name
    required = True,    			#Required item
    help = 'File name given as input'	# --Statement to display when helping
)

args = parser.parse_args()
N = args.line
filename = args.filename

#Display the first N lines
f = open(filename)
for x in xrange(N):
	print f.next().strip()
f.close()

When you do the above.

python


python knock014.py -l 3 -f hightemp.txt
# >>>Kochi Prefecture Ekawasaki 41 2013-08-12
# >>>40 Kumagaya, Saitama Prefecture.9	2007-08-16
# >>>40 Tajimi, Gifu Prefecture.9	2007-08-16

python knock014.py -l 3 -f hightemp.txt
# >>>Kochi Prefecture Ekawasaki 41 2013-08-12
# >>>40 Kumagaya, Saitama Prefecture.9	2007-08-16
# >>>40 Tajimi, Gifu Prefecture.9	2007-08-16
# >>>Yamagata 40 Yamagata.8	1933-07-25
# >>>Yamanashi Prefecture Kofu 40.7	2013-08-10
# >>>Wakayama Prefecture Katsuragi 40.6	1994-08-08
# >>>Shizuoka Prefecture Tenryu 40.6	1994-08-04
# >>>40 Katsunuma, Yamanashi Prefecture.5	2013-08-10
# >>>40 Koshigaya, Saitama Prefecture.4	2007-08-16
# >>>Gunma Prefecture Tatebayashi 40.3	2007-08-16

Click here for confirmation with UNIX commands.

python


head -3 hightemp.txt

head hightemp.txt

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

knock015.py


# -*- coding: utf-8 -*-

import sys
import argparse

parser = argparse.ArgumentParser(description='Tail command. Accepts an integer and a file name.')

#Number of lines
parser.add_argument(
	'-l', '--line',
	type = int,
	dest = 'line',
	default = 10,
	help = 'Equivalent to the number of lines specified by the tail command'
)

#file name
parser.add_argument(
    '-f', '--filename',
    type = str,						#Specify the type of value to receive
    dest = 'filename',     			#Save destination variable name
    required = True,    			#Required item
    help = 'File name given as input'	# --Statement to display when helping
)

args = parser.parse_args()
N = args.line
filename = args.filename

#Show last N lines
f = open(filename)
lines = f.readlines()
M = len(lines)

for i, line in enumerate(lines):
	if i+N >= M:
		#print i
		print line.strip()

f.close()

Basically, I just changed the last process from 14. Click here for confirmation with UNIX commands.

python


tail -3 hightemp.txt

tail hightemp.txt

Recommended Posts

Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 2 [First half: 10 ~ 15]
[Chapter 4] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock Chapter 1 in Python
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)
After doing 100 language processing knock 2015, I got a lot of basic Python skills Chapter 1
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
100 natural language processing knocks Chapter 1 Preparatory movement (first half)
100 natural language processing knocks Chapter 3 Regular expressions (first half)
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 natural language processing knocks Chapter 6 English text processing (first half)
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
3. Natural language processing with Python 5-1. Concept of sentiment analysis [AFINN-111]
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (Second Half)
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
Speed comparison of Wiktionary full text processing with F # and Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
10 functions of "language with battery" python
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
Coexistence of Python2 and 3 with CircleCI (1.0)
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
I have 0 years of programming experience and challenge data processing with python
Basics of binarized image processing with Python