This article is a continuation of my book Introduction to Python with 100 Knocking on Language Processing. For those who want to learn the basics of Python (and Unix commands) while working on 100 Knock Chapter 2.
If you can reach number 12, the basics of Python are almost OK.
After that, I think that you will feel that you will steadily acquire a little detailed knowledge. First, download the file specified in the question text by an appropriate method.
$ wget https://nlp100.github.io/data/popular-names.txt
In natural language processing, there are many situations where you want to process a huge text file line by line, and so is the problem in this chapter.
TSV (Tab-Separated Values) and CSV (Comma-Separated Values) are often used to express a structure in which one row is one data and columns are divided into items. The files dealt with in this chapter are tab-delimited, so it's TSV.
(Although it is confusing, these formats are sometimes collectively referred to as CSV (Character-Separated Values).)
Problems in this chapter use pandas and the standard library csv You can solve it, but I don't feel the need so much, so I will explain the simplest method. By the way, the coding style of the answer example will follow PEP8. For example, variable name / function name should be snake_case
, indent should be 4 half-width spaces, etc.
If you are not familiar with options, pipe redirects, and less
, please read verses 1 and 3 of this Qiita article. Then check the contents of the downloaded file with $ less popular-names.txt
.
Since the command name is specified in the problem statement, you can use it by making full use of --help
without knowing it. However, most of the Unix commands in this chapter are common, so keep them in mind if you can.
In C, we used file pointers, but in Python we use a convenient data type called file objects. The file object is iterable, so if you want to read a text file line by line, write:
with open('popular-names.txt') as f:
for line in f:
print(line, end='')
The with syntax is great for doing f = open ('popular-names.txt')
and f.close ()
when exiting a block. The official documentation also states that it is a ** good habit **, so be sure to use it.
In each loop of the for statement, the contents of each line are assigned to line
.
To use multiple files at once, use commas like with open ('test1') as f1, open ('test2') as f2
.
Use sys.stdin
if you want to read standard input line by line. This is also a file object. All the problems in this chapter can be done the above way, but it's somewhat more convenient to use standard input.
import sys
for line in sys.stdin:
print(line, end='')
(Since the standard input is ʻopen ()from the beginning, please think that
with` is unnecessary.)
(Some Unix commands are designed to accept standard input and file names, but it is a little troublesome to do so with Python → [Reference article](https://qiita.com/hi-asano/items/ 010e7e3410ea4e1486cb)))
Count the number of lines. Use the wc command for confirmation.
Let's save the Python script and get popular-names.txt
from the standard input and run it.
Below is an example of the answer.
q10.py
import sys
i = 0
for line in sys.stdin:
i += 1
print(i)
$ python q10.py < popular-names.txt
2780
You can't use C ʻi ++in Python, so use the cumulative assignment operator
+ =`.
(Avoid f.read (). Splitlines ()
, f.readlines ()
, list (f)
when the file size is large or when you want to perform complicated processing.
I will also touch on a slightly elegant method. It takes advantage of the fact that sys.stdin
is iterable and that the Python for block does not form a scope.
import sys
i = 0
for i, _ in enumerate(sys.stdin, start=1):
pass
print(i)
The built-in function ʻenumerate ()that counts the number of loops is useful. It's a Python convention to receive unused return values with
_. The
pass` statement is used when you don't want to do anything but need to write something grammatically.
Use the wc
command (* word count *) for confirmation. If you use it normally, you will see various things, so specify the options -l, --lines
.
$ wc -l popular-names.txt
2780 popular-names.txt
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
Let's use str.replace (old, new)
. This method replaces the substring ʻoldin the string with
newand returns it. The tab character is
\ t` as in C.
Below is an example of the answer.
q11.py
import sys
for line in sys.stdin:
print(line.replace('\t', ' '), end='')
Since there are many lines, check the result with python q11.py <popular-names.txt | less
etc.
There are three Unix commands listed, but sed -e's / \ t / / g'popular-names.txt
is the most popular. On Twitter, I sometimes see people who correct their typos in this way with rips. sed
stands for Stream EDitor and is a versatile command.
Personally, the s / \ t / / g
part is troublesome, so I wonder if I use tr'\ t'''<popular-names.txt
...
However, sed
is a command you should know, and you can use sed -n 10p
to extract the 10th line, sed -n 10, 20p
to extract the 10th to 20th lines, and so on. Convenient.
In the next question you will learn about writing files. Use ʻopen (filename,'w)` to open a text file in write mode.
with open('test', 'w') as fo:
# fo.write('hoge')
print('hoge', file=fo)
You can use the write ()
method when writing, but it is a little inconvenient to forget to add a line break, so I think it is better to use the optional argument file
ofprint ()
.
Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
Below is an example of the answer.
q12.py
import sys
with open('col1.txt', 'w') as fo1,\
open('col2.txt', 'w') as fo2:
for line in sys.stdin:
cols = line.rstrip('\n').split('\t')
print(cols[0], file=fo1)
print(cols[1], file=fo2)
It would have been nice to write the two ʻopen ()together on one line, but by using the backslash
` it is considered that the statement continues even if the line breaks.
You can use ʻopen ()to read
popular-names.txt, but I don't want the
with` statement to be longer, so I use the standard input method.
The part of line.rstrip ('\ n'). Split ('\ t')
is called a method chain, and the methods are executed in order from the left. In this problem, the result does not change without rstrip ()
, but it is to prevent cols [-1]
from including a newline character. It's a habit of reading text.
Unix commands are OK if you specify the options -f, --fields
in cut
. You can execute the command twice, but you can do it at once with &&
.
!cut -f1 popular-names.txt > col1.txt && cut -f2 popular-names.txt > col2.txt
Combine col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
This is easy, isn't it? Below is an example of the answer.
q13.py
with open('col1.txt') as fi1,\
open('col2.txt') as fi2:
for col1, col2 in zip(fi1, fi2):
col1 = col1.rstrip()
col2 = col2.rstrip()
print(f'{col1}\t{col2}')
Writing q13.py
This is where the built-in function zip ()
comes into play. rstrip ()
removes all trailing newline and whitespace characters if the argument is omitted.
(Given that there will be 3 or more input files, it is better to receive the return value of zip ()
as one variable and join ()
. Further, the behavior is closer to the paste
command. It's hard to try. Please read this article.)
Unix commands are OK with paste col1.txt col2.txt
.
Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.
Command line arguments can be obtained with sys.argv
, but it is somewhat more convenient to use ʻargparse`. For instructions, read the Official Great Tutorial from the beginning to the "Short Options" ...
File objects are not sequence type and cannot be sliced. Let's count the number of lines in other ways.
Below is an example of the answer.
q14.py
import argparse
import sys
def arg_lines():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--lines', default=1, type=int)
args = parser.parse_args()
return args.lines
def head(N):
for i, line in enumerate(sys.stdin):
if i < N:
print(line, end='')
else:
break
if __name__ == '__main__':
head(arg_lines())
$ python q14.py -n 10 < popular-names.txt
The ʻargparse part is made into a function independently so that it can be reused in the next problem. ʻIf __name__ =='__main__':
prevents the main process from being executed arbitrarily at the time of import.
(Noisy, ʻif name =='main':` It's not good to write a long process below, because all variables are global. Python has slow access to global variables, so performance is poor. There is also a disadvantage that it goes down. Actually, it will be slightly faster by making the code that was written solidly without making it into a function.)
The break
is a control statement used inside the for
block, which immediately exits the for
statement. Remember with continue
(immediately move to the next loop).
The head ()
function can be written a little more elegantly.
import sys
from itertools import islice
def head(N):
for line in islice(sys.stdin, N):
print(line, end='')
Unix commands can be head -n 5 popular-names.txt
etc. If you omit the option, it runs with the default value (probably 10).
At the time of the explanation of No. 11, I wrote that the number of lines is long, so please pass it to less
with a pipe, but if you want to check only the first one, head
was enough.
If you pass these commands on a pipe, you will get a Broken Pipe Error
at the end.
If you want to prevent thishead popular-names.txt | python q11.py
Like firsthead
Or,python q11.py < popular-names.txt 2>/dev/null | head
Let's discard the error output like this.
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
Python file objects (return values of sys.stdin
and ʻopen ()) can only move the file pointer from the beginning (in principle). It is wasteful to count the number of lines once and then reopen the file again. It is a waste of memory to do
readlines ()` and slice the back ...
This can be smart if you know the * queue * first-in, last-out data structure. In other words, you can put the contents of the file line by line in a queue of length N. Then, the elements that extend beyond the length of the queue will come out without permission, so in the end, only the last N lines will remain in the queue.
Use the deque
in the collections module to implement queues in Python. Or rather, the official docs deque recipe
has an example oftail ()
, which is almost the answer to this question.
Below is an example of the answer.
q15.py
from collections import deque
import sys
from q14 import arg_lines
def tail(N):
buf = deque(sys.stdin, N)
print(''.join(buf))
if __name__ == '__main__':
tail(arg_lines())
The deque
can be turned with a for
statement as well as the list. Until now, the list was turned with a for
statement andprinted ()
, but it is faster to join ()
and print ()
at once ([Reference](https: //). qiita.com/hi-asano/items/aa2976466739f280b887#%E3%81%8A%E3%81%BE%E3%81%91-%E5%95%8F%E9%A1%8C3-print)). For number 14, print (''. Join (islice (sys.stdin, N)), end ='')
was enough.
Unix commands are OK with tail -n 5
.
Below, the difficulty level will increase a little, but I would like you to understand it.
In the previous article, I explained that "what can be turned by a for statement is called iterable". Here, the iterable data types (+ α) that have appeared so far can be classified as follows. There is no need to memorize the finer terms, but it can be easier to see what they look like when you come across new data types in the future.
zip ()
, ʻenumerate () `list
, tuple
, str
, range
, colections.deque
set
, frozenset
(immutable set type)
* Mapping
* dict
, collections.Counter
, collections.defaultdict
Now let's talk about ** Iterators ** (which we used to cheat and deal with). Data types such as lists are cumbersome when they are large because all the elements are stored in memory at once. Also, there is a lot of waste if you don't need to use len ()
or index and just use it with arguments such as for
statement orstr.join ()
. It's a waste of time to generate the later elements, especially if you don't need to loop to the last element. Iterators have eliminated such drawbacks. Since the iterator returns only one element in one loop, it has the advantage of being able to handle memory efficiently. You can't use slices, but you can do something like that with ʻitertools.islice (). Also, once the loop is completely turned, nothing can be done. Due to these restrictions, it is used exclusively in functions that take
for` statements or iterables as arguments.
Data types that support ʻin operations and
len () as well as
for` statements are called collections or containers (although they are called containers in the official documentation, abstract base class By definition, collections are stricter).
Indexes and slices can be used for all sequence types.
In addition to deque
, the collections
module defines useful data types such as Counter
and defaultdict
, so be aware of this as well. May be used in future issues.
There are times when you want to perform some operation on all the elements of an iterable object one by one, or when you want to extract only the elements that meet the conditions. Comprehensions and generator expressions can be used to describe such processing in a concise manner. Let's take the example of 100 knock 03 "Pi".
tokens = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'.split()
#List comprehension
stripped_list = [len(token.rstrip('.,')) for token in tokens]
print(stripped_list)
[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
First from the list comprehension. Previously, it was ʻappend ()in the
for statement, but in the list comprehension notation, write what you want to ʻappend ()
first, and attach the for
statement to the end with []
. Enclose. It may be a little difficult to get to, but this way of writing works faster ([Reference](https://qiita.com/hi-asano/items/aa2976466739f280b887#%E5%95%8F%E9%] A1% 8C1-% E3% 83% AA% E3% 82% B9% E3% 83% 88% E7% 94% 9F% E6% 88% 90)) Use it positively.
Next is the generator formula. The return value of the generator type is called the generator type and is a kind of iterator. The iterator can only be used by turning it again with a for
statement or passing it to another function. If anything, the latter is the more common usage.
#Generator type
stripped_iter = (len(token.rstrip('.,')) for token in tokens)
for token in stripped_iter:
print(token, end=' ')
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9
' '.join(str(len(token.rstrip('.,'))) for token in tokens)
'3 1 4 1 5 9 2 6 5 3 5 8 9 7 9'
As you can see, the generator expression just changed the list comprehension []
to ()
. However, you can omit the ()
when passing it as a function argument.
Passing a generator expression to a function has the advantage of reducing intermediate variables. Compared to passing a list (inclusive notation), memory usage can be reduced, and generator expressions are often faster.
(In rare cases, there are functions that are faster to pass a list, and this join ()
seems to be one of the exceptions ...)
The problem with generator expressions is that they are difficult to write when you want to do complicated processing. So you may want to define a function that returns an iterator. The easiest way is to use the yield
statement, and the one defined in that way is called the generator function. Of course, the object generated by the generator function is also a generator type.
To define a generator function, place the yield return value
" in the middle "(or at the end) of the function's operation. The big difference is that return
ends the processing of the function there and the local variables in the function disappear.
def tokens2lengths(tokens):
for token in tokens:
yield len(token.rstrip('.,'))
for token in tokens2lengths(tokens):
print(token, end=' ')
3 1 4 1 5 9 2 6 5 3 5 8 9 7 9
What are you grateful for? That might be the case ... You won't use generator functions in these two chapters ... It is said that the description of recursive functions will be easier, but the recursive functions themselves are not written much ... Personally, I use it in the following cases. First, suppose you have your own function process
.
for elem in lis:
if elem is not None:
outstr = process(elem)
print(outstr)
This code doesn't make a fool of the function call time as the number of elements in lis
increases. Therefore, if process is converted into a generator function, it will be slightly faster. You can also absorb the conditional expression, and the main function will be refreshed.
for outstr in iter_process(lis):
print(outstr)
The story that is not the main story has become long. Let's solve the following problem.
The documentation states that a generator usually refers to a generator function, and the object generated by a generator function is called a generator iterator. However, if you use the type ()
function to check the return type of the generator function (and generator expression), you will see Generator
. For this reason, I feel that the official generator iterator is often referred to as a generator in unofficial documents.
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
It's a difficult problem. There are various possible methods, but it seems that the only way to break the constraint of not putting the contents of the file in memory at once is to first count the total number of lines and then divide it. You can reopen the file again to return the reference point of the file object to the beginning, or use f.seek (0)
(meaning to refer to the first 0 bytes).
And ** N splitting to be as even as possible is annoying. For example, if you want to divide 14 lines into 4 lines, you want to divide them into 4 lines, 4 lines, 3 lines, 3 lines
. Let's think about it.
If you can do that, just read and write 4 lines. There is a method called fi.readline ()
that reads only one line, but that may be the turn. You should probably write to a separate file.
Below is an example of the answer.
q16.py
import argparse
import sys
def main():
parser = argparse.ArgumentParser(
description='Output pieces of FILE to FILE1, FILE2, ...;')
parser.add_argument('file')
parser.add_argument('-n', '--number', type=int,
help='split FILE into n pieces')
args = parser.parse_args()
file_split(args.file, args.number)
def file_split(filename, N):
with open(filename) as fi:
n_lines = sum(1 for _ in fi)
fi.seek(0)
for nth, width in enumerate((n_lines+i)//N for i in range(N)):
with open(f'{filename}.split{nth}', 'w') as fo:
for _ in range(width):
fo.write(fi.readline())
if __name__ == '__main__':
main()
$ python q16.py -n 3 popular-names.txt
$ wc -l popular-names.txt.split*
926 popular-names.txt.split0
927 popular-names.txt.split1
927 popular-names.txt.split2
2780 total
You don't have to worry about how to use ʻargparse. To count the number of lines, this time we use the built-in function
sum ()` to calculate the sum of the elements of the iterable.
And how to divide the integers evenly. Suppose the quotient is q
and the remainder is r
when dividing m
items into n
people.
At this time, if you normally distribute q
pieces to(nr)
people and add one remainder to the remaining r
people and distribute them by (q + 1)
pieces, it will be even. I will.
I wrote it elegantly in the ((n_lines + i) // N for i in range (N))
part. You can truncate and divide the decimal point with //
. Please see Qiita article here for why this is evenly divided.
If you don't care about the order of the lines, you can use tee ()
and ʻislice () of ʻitertools
. If you don't care about memory, it may be easier to use zip_longest ()
.
The Unix command should be split -n l / 5 -d popular-names.txt popular-names.txt
, but it may not work depending on the split
in your environment.
The latter problem is easy.
Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.
You just add the first row to the set. Only list comprehensions have been explained above, but there are also set comprehensions and dictionary comprehensions.
Below is an example of the answer.
q17.py
import sys
names = {line.split('\t')[0] for line in sys.stdin}
print('\n'.join(names))
Keep in mind that the aggregate type changes its order each time it is executed. If you don't like that, use dictionary types (CPython implementations are 3.6 and above, officially 3.7 and above are in the order of key additions).
Unix commandscut -f1 popular-names.txt | sort | uniq
Will be.uniq
To remove duplicates in adjacent lines, so to do something like thissort
Is required.
I will use it for the next problem, so I will follow the lambda expression. Lambda expressions are used to define small functions. For example, if you write lambda a, b: a + b
, it is a function that returns the sum of two more numbers. It can be called like a normal function, but it is mainly used as an optional argument of the sort ()
function. For the time being, it may be passed to another function or used as the return value of a self-made function.
The official sort HOW TO is a good source for sort ()
. It is enough to read up to "Ascending and Descending".
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
(What is a column ... I used to say that I was in line ...) Below is an example of the answer.
q18.py
import sys
sorted_list = sorted(sys.stdin, key=lambda x: int(x.split('\t')[2]), reverse=True)
print(''.join(sorted_list))
Note that the numbers in the third column will remain strings unless you cast them to a numeric type. Casting can be done with built-in functions.
The Unix command is sort -k3 -nr popular-names.txt
. It means that the third element is regarded as a number and sorted in ascending order.
Unix sort
is very good and will run large files without out of memory. It's also relatively easy to speed up (messing with the locale, splitting and merging at the end, etc.).
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
Foreshadowing of collections.Counter
has been recovered here! If you read the Documents, there should be no problem. Keep in mind that Counter
is a subclass of dict
.
Below is an example of the answer.
q19.py
from collections import Counter
import sys
col1_freq = Counter(line.split('\t')[0] for line in sys.stdin)
for elem, num in col1_freq.most_common():
print(num, elem)
Unix commandscut -f1 popular-names.txt | sort | uniq -c | sort -nr
is. When the pipes are connected one by onehead
I think it's a good idea to check the intermediate output using.
--Unix command basics --File reading and writing
str.replace()
argparse
collections
--Iterators and generators
--Comprehension notation
--Lambda expressions and sortThe JSON file can be read with the json module. Learn about regular expressions in the official Regular Expressions HOW TO. I will write a sequel if there is LGTM or comments.
(4/30 postscript) The explanation of Chapter 3 has been released. → https://qiita.com/hi-asano/items/8e303425052781d95f09
Recommended Posts