[PYTHON] I realized that it is nonsense to use the module without thinking because it is convenient.

Introduction


I'm currently analyzing Apache logs in my research. I was using a module called apache-log-parser, but suddenly I wondered if I needed to use this module and looked it up. Which is faster, the regular expression method or the module method after parsing the IP address? ?? about it.

Target


It is a log for one day (about 45MB, 184087 lines). This time, only the IP address is displayed.

Regular expressions


The first is the regular expression method.

sample_regex.py


# coding:utf-8
#A program that checks which is faster, IP address search using regular expressions or modules

import time
import sys
import re

if __name__ == "__main__":
    start = time.time()
    argvs = sys.argv
    f = open("~/apache_log_analysis/log_data/" + argvs[1])

    re_ip_addr = re.compile("((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))")

    for line in f.readlines():
        try:
            ip_addr = re_ip_addr.search(line)
            if ip_addr.group() is not None:
                print ip_addr.group()
        except:
            print "logfiled turned over"

    f.close()
    elapsed_time = time.time() - start
    print ("elapsed_time:{0}".format(elapsed_time)) + "[sec]"
    print "exit"

The result was 2.10073304176 [sec]! After doing it several times, about 0.9 [sec] seems to be the fastest, and 0.9 [sec] was measured in most of the execution results.

Module use


Next is the method using modules.

sample_module.py


# coding:utf-8
#A program that checks which is faster, IP address search using regular expressions or modules

import time
import sys
import apache_log_parser

if __name__ == "__main__":
    start = time.time()
    argvs = sys.argv
    f = open("~/apache_log_analysis/log_data/" + argvs[1])

    parser = apache_log_parser.make_parser('%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"')

    for line in f.readlines():
        try:
            log_data = parser(line)
            print log_data['remote_host']
        except:
            print "logfiled turned over"

    f.close()
    elapsed_time = time.time() - start
    print ("elapsed_time:{0}".format(elapsed_time)) + "[sec]"
    print "exit"

The result was 78.4286789894 [sec]! I checked many times to see if I made a mistake in the program because it was too late, and I wondered what it was.

in conclusion


If you think about it, it was a natural result because various other information would be parsed if the module was used. Even so, I was surprised because it was too late. When I looked at the source of the module, it was made so that it could be widely used, so it felt like it was.

In the future, I thought that it would be better not to rely too much on the module, but to pull it from the module source and use only the necessary parts if it is faster to implement it by yourself.

I learned that it's convenient, but you can't rely on it too much. ..

Recommended Posts

I realized that it is nonsense to use the module without thinking because it is convenient.
[Introduction to Mac] Convenient Mac apps and settings that I use
I don't want to search for high para because it is IQ1 (how to use lightgbm_tuner)
How to use the optparse module
Normalize the file that converted Excel to csv as it is.
When I try to use pip, SSL module is not available.
It is convenient to use Icecream instead of print when debugging.
How to use the ConfigParser module
I want to identify the alert email. --Is that x a wildcard? ---
I tried to find out what I can do because slicing is convenient
It is convenient to use Layers when putting a library on Lambda
Is it deprecated to use pip directly?
I wanted to use the find module of Ansible2, but it took some time, so make a note
The tree.plot_tree of scikit-learn was very easy and convenient, so I tried to summarize how to use it easily.
I thought it would be slow to use a for statement in NumPy, but that wasn't the case.
The translation command (TUI) made by chilling is too convenient, so use it!
python I don't know how to get the printer name that I usually use.
I tried to publish my own module so that I can pip install it
It is more convenient to use csv-table when writing a table with python-sphinx
Creating a Python document generation tool because it is difficult to use sphinx
I tried to verify the Big Bang theorem [Is it about to come back?]
Hackathon's experience that it is most important to understand the feelings of the organizer
I tried to make OneHotEncoder, which is often used for data analysis, so that it can reach the itch.
I want to use the activation function Mish
A story that was convenient when I tried using the python ip address module
I uploaded a module to pypl that deletes Japanese stop words, so share it
The command to generate RFC bibtex is convenient, so deliver it to all X students
How to solve the problem that the login screen is not displayed forever on Ubuntu 19.04 because it stops at the logo at startup
How to use the Raspberry Pi relay module Python
I wanted to use the Python library from MATLAB
I felt that I ported the Python code to C ++ 98.
I want to say that there is data preprocessing ~
I want to use the R dataset in python
The sound of tic disorder at work is ... I managed to do it with the code
I made a tool that makes it convenient to set parameters for machine learning models.
Use the pushd command, which is more convenient than the cd command, to instantly return to the original directory.
[Python] What is a tuple? Explains how to use without tuples and how to use it with examples.
[VLC] How to deal with the problem that it is not in the foreground during playback
[Python] The status of each prefecture of the new coronavirus is only published in PDF, but I tried to scrape it without downloading it.
The story of when I was addicted to Caused by SSLError ("Can't connect to HTTPS URL because the SSL module is not available.")