Text processing in Python

background

At work, I often use Perl for text processing. Python has only been used as a tool for Raspberry Pi's io processing. So, in order to know how Python is designed as a programming language, I decided to create a text processing script as a trial.

What was made

Process the following log file line by line and output it nicely.

log.txt


date Thu Apr 11 04:41:25 pm 2013
base hex  timestamps absolute
internal events logged
// version 8.0.0
Begin Triggerblock Thu Apr 11 04:41:25 pm 2013
   0.000000 Start of measurement
   0.001316 CAN 1 Status:chip status error active
   0.001399 1  1F3             Rx   d 3 00 10 00  Length = 146000 BitCount = 77 ID = 499
   0.002763 1  1E5             Rx   d 8 4C 00 21 10 00 00 00 B9  Length = 228000 BitCount = 118 ID = 485
   0.003009 1  710             Rx   d 8 00 5F 00 00 00 00 13 BE  Length = 238000 BitCount = 123 ID = 1808
   0.003175 1  C7              Rx   d 4 00 38 26 9B  Length = 158000 BitCount = 83 ID = 199
   0.003349 1  1CC             Rx   d 4 00 00 00 00  Length = 165883 BitCount = 87 ID = 460
   0.003586 1  F9              Rx   d 8 00 DA 40 33 D0 63 FF 1C  Length = 228000 BitCount = 118 ID = 249
   0.003738 1  1CF             Rx   d 3 00 00 05  Length = 144000 BitCount = 76 ID = 463
   0.003976 1  711             Rx   d 8 00 23 00 7E FF EB FC 6F  Length = 230000 BitCount = 119 ID = 1809
   0.004148 1  1D0             Rx   d 4 00 00 00 00  Length = 164000 BitCount = 86 ID = 464
   0.004382 1  C1              Rx   d 8 30 14 F6 08 32 B4 F7 70  Length = 226000 BitCount = 117 ID = 193
   0.004615 1  C5              Rx   d 8 31 27 F8 44 32 B0 F8 5C  Length = 224121 BitCount = 116 ID = 197
   0.004825 1  BE              Rx   d 6 00 00 4D 00 00 00  Length = 202242 BitCount = 105 ID = 190
   0.005051 1  D1              Rx   d 7 80 00 BF FE 00 FE 00  Length = 218121 BitCount = 113 ID = 209
   0.005292 1  C9              Rx   d 8 80 2C 5A 60 00 00 18 00  Length = 232242 BitCount = 120 ID = 201
   0.005538 1  1C8             Rx   d 8 80 00 00 00 FF FE 3F FE  Length = 238121 BitCount = 123 ID = 456
   0.005774 1  18E             Rx   d 8 00 00 00 84 78 46 08 45  Length = 228242 BitCount = 118 ID = 398
#Output only required fields
$python canlogfilter.py log.txt                                                                                                                               0.001399 1 1F3 Rx 3 00 10 00
0.002763 1 1E5 Rx 8 4C 00 21 10 00 00 00 B9
0.003009 1 710 Rx 8 00 5F 00 00 00 00 13 BE
0.003175 1 0C7 Rx 4 00 38 26 9B
0.003349 1 1CC Rx 4 00 00 00 00
0.003586 1 0F9 Rx 8 00 DA 40 33 D0 63 FF 1C
0.003738 1 1CF Rx 3 00 00 05
0.003976 1 711 Rx 8 00 23 00 7E FF EB FC 6F
0.004148 1 1D0 Rx 4 00 00 00 00
0.004382 1 0C1 Rx 8 30 14 F6 08 32 B4 F7 70
0.004615 1 0C5 Rx 8 31 27 F8 44 32 B0 F8 5C
0.004825 1 0BE Rx 6 00 00 4D 00 00 00
0.005051 1 0D1 Rx 7 80 00 BF FE 00 FE 00
0.005292 1 0C9 Rx 8 80 2C 5A 60 00 00 18 00
0.005538 1 1C8 Rx 8 80 00 00 00 FF FE 3F FE
0.005774 1 18E Rx 8 00 00 00 84 78 46 08 45

#Output with added difference time
$python canlogfilter.py log.txt -d(-d option)                                                                                                                            0.001399 0.001399 1 1F3 Rx 3 00 10 00
0.001364 0.002763 1 1E5 Rx 8 4C 00 21 10 00 00 00 B9
0.000246 0.003009 1 710 Rx 8 00 5F 00 00 00 00 13 BE
0.000166 0.003175 1 0C7 Rx 4 00 38 26 9B
0.000174 0.003349 1 1CC Rx 4 00 00 00 00
0.000237 0.003586 1 0F9 Rx 8 00 DA 40 33 D0 63 FF 1C
0.000152 0.003738 1 1CF Rx 3 00 00 05
0.000238 0.003976 1 711 Rx 8 00 23 00 7E FF EB FC 6F
0.000172 0.004148 1 1D0 Rx 4 00 00 00 00
0.000234 0.004382 1 0C1 Rx 8 30 14 F6 08 32 B4 F7 70
0.000233 0.004615 1 0C5 Rx 8 31 27 F8 44 32 B0 F8 5C
0.000210 0.004825 1 0BE Rx 6 00 00 4D 00 00 00
0.000226 0.005051 1 0D1 Rx 7 80 00 BF FE 00 FE 00
0.000241 0.005292 1 0C9 Rx 8 80 2C 5A 60 00 00 18 00
0.000246 0.005538 1 1C8 Rx 8 80 00 00 00 FF FE 3F FE
0.000236 0.005774 1 18E Rx 8 00 00 00 84 78 46 08 45

#Narrow down and output records according to specific field values(-u option)
$python canlogfilter.py log.txt -u 710 0C9 18E                                                                                                                
0.003009 1 710 Rx 8 00 5F 00 00 00 00 13 BE
0.005292 1 0C9 Rx 8 80 2C 5A 60 00 00 18 00
0.005774 1 18E Rx 8 00 00 00 84 78 46 08 45

#Delete and output records according to specific field values(-o option)
$python canlogfilter.py log.txt -o 710 0C9 18E                                                                                                                
0.001399 1 1F3 Rx 3 00 10 00
0.002763 1 1E5 Rx 8 4C 00 21 10 00 00 00 B9
0.003175 1 0C7 Rx 4 00 38 26 9B
0.003349 1 1CC Rx 4 00 00 00 00
0.003586 1 0F9 Rx 8 00 DA 40 33 D0 63 FF 1C
0.003738 1 1CF Rx 3 00 00 05
0.003976 1 711 Rx 8 00 23 00 7E FF EB FC 6F
0.004148 1 1D0 Rx 4 00 00 00 00
0.004382 1 0C1 Rx 8 30 14 F6 08 32 B4 F7 70
0.004615 1 0C5 Rx 8 31 27 F8 44 32 B0 F8 5C
0.004825 1 0BE Rx 6 00 00 4D 00 00 00
0.005051 1 0D1 Rx 7 80 00 BF FE 00 FE 00
0.005538 1 1C8 Rx 8 80 00 00 00 FF FE 3F FE

#Combination of options(-u, -d)
$python canlogfilter.py log.txt -u 710 0C9 18E -d                                                                                                             
0.003009 0.003009 1 710 Rx 8 00 5F 00 00 00 00 13 BE
0.002283 0.005292 1 0C9 Rx 8 80 2C 5A 60 00 00 18 00
0.000482 0.005774 1 18E Rx 8 00 00 00 84 78 46 08 45

Source code

canlogfilter.py


import re
import argparse

class Record:
    def __init__(self):
        self.crtime   = 0.00000
        self.ch       = 1
        self.hexid    = 0x000
        self.dir      = "Rx"
        self.stat     = "d"
        self.dlc      = 0
        self.data     = []
        self.length   = 0
        self.bitcount = 0
        self.decid    = 0

def main():
    parser = argparse.ArgumentParser(description = 'CanlogFilter')

    parser.add_argument('inputFile',        help = 'Input file path')
    parser.add_argument('--difftime', '-d', action = 'store_const', const = True, default = False,  help = 'Print with difftime')
    parser.add_argument('--pickup',   '-u', nargs = '*', help = 'pick up records')
    parser.add_argument('--dropoff',  '-o', nargs = '*', help = 'drop off records')
    args = parser.parse_args()

    canlog = []
    canlog = parse(args.inputFile)

    if args.pickup != None and args.dropoff != None:
        print "--pickup and --dropoff, both provide"
        return -1
    elif args.pickup != None:
        canlog = pick_log(canlog, map(lambda x:int(x, 16), args.pickup))
    elif args.dropoff != None:
        canlog = drop_log(canlog, map(lambda x:int(x, 16), args.dropoff))

    if args.difftime == True:
        printlog_with_diff_time(canlog)
    else:
        printlog(canlog)

def parse(filename):
    canlog = []
    for line in open(filename, 'r'):
        fields = line.split()
        if re.match("1|2", fields[1]):
            rec = Record()
            rec.crtime   = float(fields[0])
            rec.ch       = int(fields[1], 10)
            rec.hexid    = int(fields[2], 16)
            rec.dir      = fields[3]
            rec.stat     = fields[4]
            rec.dlc      = int(fields[5], 10)
            rec.data     = map(lambda x:int(x, 16), fields[6:rec.dlc+6])
            rec.length   = int(fields[rec.dlc+8], 10)
            rec.bitcount = int(fields[rec.dlc+11], 10)
            rec.decid    = int(fields[rec.dlc+14], 10)
            canlog.append(rec)
    return canlog

def pick_log(canlog, ids):
    ret = []
    for rec in canlog:
        if rec.hexid in ids:
            ret.append(rec)
    return ret

def drop_log(canlog, ids):
    ret = []
    for rec in canlog:
        if not rec.hexid in ids:
            ret.append(rec)
    return ret

def printlog(canlog):
    for rec in canlog:
        print '%f %d %03X %s %d' % (rec.crtime, rec.ch, rec.hexid, rec.dir, rec.dlc),
        for byte in rec.data:
            print '%02X' % byte,
        print

def printlog_with_diff_time(canlog):
    prevtime = 0
    difftime = 0
    for rec in canlog:
        difftime = rec.crtime - prevtime
        print '%f %f %d %03X %s %d' % (difftime, rec.crtime, rec.ch, rec.hexid, rec.dir, rec.dlc),
        for byte in rec.data:
            print '%02X' % byte,
        print
        prevtime = rec.crtime

if __name__ == '__main__' : main()

Impression that the standard one was enough and easy to use. People who write more may use different ones.

Summary ・ TODO

Source code received from comments on TODO

From shiracamus

It seems that TODO is thinking about reviewing the class definition, but I implemented it in my own way. I hope it will be helpful for your review.

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import argparse

def hexint(x):
    return int(x, 16)


class Record:

    @staticmethod
    def create(line):
        fields = line.split()
        if len(fields) < 2 or fields[1] not in ('1', '2'):
            return None

        record = Record()
        record.crtime = float(fields[0])
        record.ch = int(fields[1])
        record.hexid = hexint(fields[2])
        record.dir = fields[3]
        record.stat = fields[4]
        record.dlc = int(fields[5])
        record.data = map(hexint, fields[6:record.dlc + 6])
        record.length = int(fields[record.dlc + 8])
        record.bitcount = int(fields[record.dlc + 11])
        record.decid = int(fields[record.dlc + 14])
        return record

    def __str__(self):
        return ('{crtime} {ch} {hexid:03X} {dir} {dlc}'.format(**vars(self))
                + ' '.join('%02X' % byte for byte in self.data))


class Canlog:

    def __init__(self, records):
        self.records = list(records)

    @staticmethod
    def create(lines):
        return Canlog(record
                      for record in map(Record.create, lines)
                      if record != None)

    def pickup(this, ids):
        return Canlog(record
                      for record in this.records
                      if record.hexid in ids)

    def dropoff(canlog, ids):
        return Canlog(record
                      for record in self.records
                      if record.hexid not in ids)

    def print_without_diff_time(self):
        for record in self.records:
            print record

    def print_with_diff_time(self):
        prevtime = 0
        for record in self.records:
            difftime = record.crtime - prevtime
            print difftime, record
            prevtime = record.crtime


def main():
    parser = argparse.ArgumentParser(description='CanlogFilter')
    parser.add_argument('inputFile', help='Input file path')
    parser.add_argument('--difftime', '-d', action='store_const', const=True, default=False, help='Print with difftime')
    parser.add_argument('--pickup', '-u', nargs='*', help='pick up records')
    parser.add_argument('--dropoff', '-o', nargs='*', help='drop off records')

    args = parser.parse_args()
    if args.pickup != None and args.dropoff != None:
        print "--pickup and --dropoff, both provide"
        return -1

    with open(args.inputFile) as lines:
        canlog = Canlog.create(lines)

    if args.pickup != None:
        canlog = canlog.pickup(map(hexint, args.pickup))
    elif args.dropoff != None:
        canlog = canlog.dropoff(map(hexint, args.dropoff))

    if args.difftime == True:
        canlog.print_with_diff_time()
    else:
        canlog.print_without_diff_time()


if __name__ == '__main__':
    main()

He showed an example of how to organize a class for TODO. Not only that, but it was also very helpful as a detailed indentation and a little coding method.

From knoguchi

A person who has been using Python for a long time. I tried to rewrite it a little with the functions of Python. This is an over-engineering code, but for your reference. https://gist.github.com/knoguchi/4fc486a0cc39c1fd256d2fb6f619ee98

Grouped exclusive options with argparse. You no longer have to check with if. Changed to convert hexadecimal argument type with argparse. The default value of Record is set as a keyword argument. Create a Record object with a class method Since dlc is the length of data, I changed it to a property Replaced a list-passed function with a generator. Memory consumption is reduced when processing large volumes of logs. Replaced the processing of fields with namedtuple. Now you can access it by name, so you don't have to rewrite the subscripts when the field> is added to the log. Replaced the filter using the for loop with filter. Moved the display of record contents that was done in two places to the str method of Record. The file is automatically closed after the process is completed using the with context manager. if x! = None can be written as if x. The same is true for if x == True. Use is> to explicitly check None, True. Addendum: I commented without noticing @ shiracamus's post, so I suffered a lot.

import re
import argparse
from collections import namedtuple


class Record:
    def __init__(self, crtime=0.00000, ch=1, hexid=0x000, dir="Rx", stat="d", data=None, length=0, bitcount=0, decid=0):
        self.crtime = crtime
        self.ch = ch
        self.hexid = hexid
        self.dir = dir
        self.stat = stat
        self.data = data or []
        self.length = length
        self.bitcount = bitcount
        self.decid = decid

    @property
    def dlc(self):
        return len(self.data)

    @classmethod
    def parse_from_file(cls, input_file):
        """
        The log format is fixed header fields, variable length data, ordered key-value pairs
        header fields: crtime, ch, hexid, dir, stat, dlc
        variable data: byte * dlc
        """
        HEADER_TYPES = (
            ('crtime', float),
            ('ch', int),
            ('hexid', lambda s: int(s, 16)),
            ('dir', str),
            ('stat', str),
            ('dlc', int),
        )
        HEADER_LENGTH = len(HEADER_TYPES)
        Header = namedtuple("Header", [field for field, _ in HEADER_TYPES])

        for line in input_file:
            fields = line.split()
            if not re.match("1|2", fields[1]):
                # ignore non-data rows
                continue

            # extract header
            header_values = fields[:HEADER_LENGTH]
            header_values = [func(value) for (field, func), value in zip(HEADER_TYPES, header_values)]
            header = Header(*header_values)

            # extract data
            data = map(lambda x: int(x, 16), fields[HEADER_LENGTH:][:header.dlc])

            # extract trailer
            length = fields[-7]
            bitcount = fields[-4]
            decid = fields[-1]

            yield cls(
                crtime=header.crtime,
                ch=header.ch,
                hexid=header.hexid,
                dir=header.dir,
                stat=header.stat,
                data=data,
                length=length,
                bitcount=bitcount,
                decid=decid
            )

    def __str__(self):
        return '%f %d %03X %s %d %s' % (
            self.crtime, self.ch, self.hexid, self.dir, self.dlc,
            ' '.join(["%02X" % byte for byte in self.data])
        )


def main():
    parser = argparse.ArgumentParser(description='CanlogFilter')

    parser.add_argument('inputFile', help='Input file path')
    parser.add_argument('--difftime', '-d', action='store_const', const=True, default=False, help='Print with difftime')

    group = parser.add_mutually_exclusive_group()
    group.add_argument('--pickup', '-u', nargs='*', type=lambda x: int(x, 16), help='pick up records')
    group.add_argument('--dropoff', '-o', nargs='*', type=lambda x: int(x, 16), help='drop off records')

    args = parser.parse_args()

    with open(args.inputFile) as input_file:
        canlog = Record.parse_from_file(input_file)

        if args.pickup:
            canlog = pick_log(canlog, args.pickup)
        elif args.dropoff:
            canlog = drop_log(canlog, args.dropoff)

        if args.difftime:
            printlog_with_diff_time(canlog)
        else:
            printlog(canlog)


def pick_log(canlog, ids):
    return filter(lambda rec: rec.hexid in ids, canlog)


def drop_log(canlog, ids):
    return filter(lambda rec: rec.hexid not in ids, canlog)


def printlog(canlog):
    for rec in canlog:
        print rec


def printlog_with_diff_time(canlog):
    prevtime = 0
    for rec in canlog:
        difftime = rec.crtime - prevtime
        print '%f %s' % (difftime, rec)
        prevtime = rec.crtime


if __name__ == '__main__': main()

The refactored part is as you commented. The usage of argparse has also been fixed.

This is common to both sources, but I wrote in a Perl-like statement like for line in open (filename) ~ with the statement with ~ as ~ To be honest, I'm writing it for the first time, and I've come to want to be able to write like Python, saying that Python has a syntax that I don't know yet.

I thought that Python has a way of writing that is used by various people, even for text processing. Let's write more.

Recommended Posts

Text processing in Python
UTF8 text processing in python
File processing in Python
Multithreaded processing in python
Queue processing in Python
Asynchronous processing (threading) in python
Speech to speech in python [text to speech]
Image Processing Collection in Python
Using Python mode in Processing
GOTO in Python with Sublime Text 3
100 Language Processing Knock Chapter 1 in Python
Extract text from images in Python
Sort large text files in Python
Reading and writing text in Python
Quadtree in Python --2
Python in optimization
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
python image processing
Meta-analysis in Python
Unittest in python
Epoch in Python
Discord in Python
Sudoku in Python
Python file processing
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python
Easy image processing in Python with Pillow
Try text mining your diary in Python
Duplicate prohibition processing in GAE / Python Datastore
Read text in images with python OCR
Status of each Python processing system in 2020