On Linux there is a command called
tail that allows you to get the
n line from the end of the file. It's pretty convenient so I want to be able to do the same with Python.
I would like to create a function that retrieves n lines from the end of a file with
tail (file_name, n) using several approaches.
For the last approach, go to the site it-swarm.dev Efficiently find the last line in your text file -I refer to the page swarm.dev/ja/python/ Efficiently find the last line of a text file / 940298444 /).
The file to be read could be a text file, but this time I will use the
The file name is
test.csv. The content is a summary of Bitcoin prices for 86400 lines (one day) per second.
date,price,size 1588258800,933239.0,3.91528007 1588258801,933103.0,3.91169431 1588258802,932838.0,2.91 1588258803,933217.0,0.5089811 (Omission) 1588345195,955028.0,0.0 1588345196,954959.0,0.05553 1588345197,954984.0,1.85356 1588345198,955389.0,10.91445135 1588345199,955224.0,3.61106
Although it has nothing to do with the main subject, if you explain each item for the time being, the units of date, price, and size are UnixTime, YEN, BTC.
The first line means that at time
1588258800, that is, at 0:00:00 on May 1st,
3.915280007 of Bitcoin was bought and sold for
First, use the built-in function ʻopen ()` to get the file object, read all the lines from the beginning, and output only the last n lines. If n is 0 or a negative integer, strange results will be obtained, so it is actually necessary to perform processing limited to natural numbers, but it is important to make it easy to see.
def tail(fn, n): #Open the file and get all the lines in a list with open(fn, 'r') as f: #Read one line.The first line is the header, so discard the result f.readline() #Read all lines lines = f.readlines() #Returns only n lines from the back return lines[-n:] #result file_name = 'test.csv' tail(file_name, 3) # ['1588345197,954984.0,1.85356\n', # '1588345198,955389.0,10.91445135\n', # '1588345199,955224.0,3.61106\n']
If it is a text file, you can leave it as it is, but make it a little easier to use for csv files.
def tail(fn, n): #Open the file and get all the lines in a list with open(fn, 'r') as f: f.readline() lines = f.readlines() #Return a string as an array.By the way str->Type convert to float return [list(map(float ,line.strip().split(','))) for line in lines[-n:]] #result tail(file_name, 3) # [[1588345197.0, 954984.0, 1.85356], # [1588345198.0, 955389.0, 10.91445135], # [1588345199.0, 955224.0, 3.61106]]
The only thing that has changed is the
return line, but the functions are so crowded that it's hard to understand, so I'll break it down.
The following processing is performed for each line.
['1588345197', '954984.0', '1.85356']
['1588345197', '954984.0', '1.85356']->
[1588345197.0, 954984.0, 1.85356]
Since the csv module automatically converts each line to an array, the processing will be a little slower, but it can be described more concisely.
import csv def tail_csv(fn, n): with open(fn) as f: #Convert file object to csv reader reader = csv.reader(f) #Discard the header next(reader) #Read all lines rows = [row for row in reader] #Float only the last n lines and return return [list(map(float, row)) for row in rows[-n:]]
Since pandas has a tail function, it is surprisingly easy to write.
import pandas as pd def tail_pd(fn, n): df = pd.read_csv(fn) return df.tail(n).values.tolist()
Since pandas deals with numpy arrays, it is converted to a list at the end with
tolist (). It is not necessary if you can keep the numpy array.
has a convenient command called timeit`, so let's compare it with the number of loops set to 100.
timeit -n100 tail('test.csv', 3) 18.8 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit -n100 tail_csv('test.csv', 3) 67 ms ± 822 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) timeit -n100 tail_pd('test.csv', 3) 30.4 ms ± 2.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
It turned out that it was quick to read as it was without using any module. Cospa seems to be the best because pandas is the simplicity of the code and the speed is reasonable. Since the csv module purposely converts from a character string to an array up to the unused line, the result is extremely poor.
All of the approaches so far are reading all the lines after all. However, I want the last few lines, so if there is a way to read the file from the back, the reading should be completed in an instant.
Refer to the page [Efficiently find the last line of a text file](https://www.it-swarm.dev/ja/python/ Efficiently find the last line of a text file / 940298444 /) did.
Read about 100 bytes at a time from the back, and if a line feed code is found, the character string after that is the last line. Only the last line is found in the page, but to realize the
tail command You need to find the
n line from the back, so adjust only there.
First, as a preliminary knowledge, we will explain how to operate the file pointer.
There are three functions to use:
f.read (size), and
f.seek (offset, whence).
f.tell () returns the position currently pointed to by the pointer.
f.read (size) returns the contents read
size bytes from the current position. The pointer moves to the read position. It can only be advanced in the positive direction.
f.seek (offset, whence) is a function that moves the position of the pointer.
whence represents the position. One of the values
0, 1, 2 is entered.
0 is the beginning of the file,
1 is the current pointer position, and
2 is the end of the file. Means.
Enter an integer for ʻoffset
. Unlike read
, you can also pass a negative value, so for example, f.seek (-15, 1)` returns the current pointer position to the beginning by 15.
We will implement it based on these.
#Use split that can use regular expressions import re def tail_b(fn, n=None): #If n is not given, only the last line is returned alone. if n is None: n = 1 is_list = False #n is a natural number elif type(n) != int or n < 1: raise ValueError('n has to be a positive integer') #When n is given, n rows are returned together in a list. else: is_list = True # 128 *Read n bytes at a time chunk_size = 64 * n # seek()Behaves unexpectedly except in binary mode'rb'To specify with open(fn, 'rb') as f: #First line to find the leftmost position excluding the header(Header line)I Read f.readline() #The very first line feed code is at the left end(End when reading from the end of the file)To # -1 is'\n'1 byte left_end = f.tell() - 1 #End of file(2)1 byte back from. read(1)To read in f.seek(-1, 2) #Because there are often blank lines and spaces at the end of the file #Position of the last character in the file excluding them(Right end)Find while True: if f.read(1).strip() != b'': #Right end right_end = f.tell() break #Take one step, so take two steps down f.seek(-2, 1) #Number of bytes remaining unread to the far left unread = right_end - left_end #Number of lines read.If this becomes n or more, it means that n lines have been read. num_lines = 0 #Variable for connecting the read byte strings line = b'' while True: #The number of unread bytes is chunk_When it becomes smaller than size,Chunk fraction_size if unread < chunk_size: chunk_size = f.tell() - left_end #Chunk from your current location_Move to the top of the file by size f.seek(-chunk_size, 1) #Read only the amount you moved chunk = f.read(chunk_size) #Connect line = chunk + line #Since I proceeded again with read, chunk again at the beginning_size move f.seek(-chunk_size, 1) #Update the number of unread bytes unread -= chunk_size #If a line feed code is included if b'\n' in chunk: #Num for the number of line feed codes_Count up lines num_lines += chunk.count(b'\n') #Read more than n lines,Or when the number of unread bytes reaches 0, a signal to end if num_lines >= n or not unread: #Last found line feed code leftmost_blank = re.search(rb'\r?\n', line) #The part before the line feed code found last is unnecessary line = line[leftmost_blank.end():] #Convert byte string to string line = line.decode() #Line feed code'\r\n'Or\n'Separate with and convert to an array lines = re.split(r'\r?\n', line) #Finally take out n pieces from the back,Convert to float type and return result = [list(map(float, line.split(','))) for line in lines[-n:]] #If n is not specified, the last line is returned alone. if not is_list: return result[-1] else: return result
The explanation is given in the annotation. Now let's do the main time measurement.
timeit -n100 tail_b(fn, 3) 87.8 µs ± 3.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The best time so far was the first approach, which was
18.8 ms ± 175 µs. It means that the execution time is about
0.5%. That is,
200 times, but 86400 lines from the beginning. It is natural that there is a big difference because it is the difference between reading all or reading a few lines from the back.
I introduced four patterns, but there seems to be another way to execute the system's
tail command using the
subprocess module. This is an environment-dependent method, so I omitted it this time.
The most recommended method I've introduced is to write in two lines using
pandas. Python is a language that lets you use the code of others to learn how you can enjoy yourself.
As for the method of reading from the back of the file, it is recommended to use it when you need speed or when the number of lines and characters is ridiculously large and it takes too much time to read the file from the beginning.
Also, it doesn't make any sense to use
64 to determine
chunk_size. It's probably fastest to set it to about the length of one line in a file, but some files vary greatly in length depending on the line. Therefore, I can't say anything.
If you're dealing with files like short lines with a few characters, but long lines with 10,000 characters, you'll need to change chunk_size dynamically.
For example, if the number of lines found in one search does not reach n, the next chunk_size is doubled and doubled.
It seems that it is also effective to determine the next chunk_size from the number of lines that have been searched and the average length of the lines.