[PYTHON] Create a binary data parser using Kaitai Struct

Introduction

LOCAL Student Department Advent Calendar Day 11

I was worried that all the people were stubborn and "Isn't I too weak ...?", So I decided to throw a changing ball. As far as I've searched, there is no Japanese literature, so I think anyone who researches this in the future will see this article almost automatically. I would like to ask all of you. If you think, "What are you talking about? Isn't it full of incorrect information?", Please comment. I will do my best to fix it.

What is Kaitai Struct?

Overview

Official: kaitai.io

Kaitai Struct is a declarative language used to describe binary data structures.

The source code of the binary data parser can be automatically generated based on the data structure written in your own language.

Supported languages (as of December 2, 2019)
  • C++ / STL
  • C#
  • Go (entry-level support)
  • Java
  • JavaScript
  • Lua
  • Perl
  • PHP
  • Python
  • Ruby

license The Compiler and Visualizer described later are GPL v3 +, and the library for each language is MIT (JS is Apache v2). Does this mean that the source code generated using Compiler will infect the GPL ...? Please tell me a detailed person.

Installation

Kaitai Struct Compiler (KSC) For more information on installation, click here (http://kaitai.io/#download) Mac is one shot with brew install kaitai-struct-compiler. For Windows, go to the link above and download the installer.

Debian / Ubuntu-based distributions can install packages from the official .deb repository.
# Import GPG key, if you never used any BinTray repos before
sudo apt-key adv --keyserver hkp://pool.sks-keyservers.net --recv 379CE192D401AB61

# Add stable repository
echo "deb https://dl.bintray.com/kaitai-io/debian jessie main" | sudo tee /etc/apt/sources.list.d/kaitai.list
# ... or unstable repository
echo "deb https://dl.bintray.com/kaitai-io/debian_unstable jessie main" | sudo tee /etc/apt/sources.list.d/kaitai.list

sudo apt-get update
sudo apt-get install kaitai-struct-compiler
If you are using another OS, clone it from [here](https://github.com/kaitai-io/kaitai_struct_compiler) and build it.

Kaitai Struct Visualizer (KSV) This is a simple visualizer for .ksy files. Written in Ruby, it is available as a gem package.

gem install kaitai-struct-visualizer

(Git repository)

Try using

For well-known files, there is a .ksy file in the Official github repository (https://github.com/kaitai-io/kaitai_struct_formats). (If you want to use the .ksy file that exists here, please check the license described in meta / license in the file.) If you write a new .ksy, send a pull request. (kaitai_struct_formats/CONTRIBUTING.md)

Example) matrix

Save to file (np.array)

matrix.py


import numpy as np
import struct


def create_header(*mats: [np.ndarray], magic: bytes = None) -> bytes:
    header = magic
    header += struct.pack('<H', len(mats))
    length = len(header) + 8 * len(mats)
    for mat in mats:
        header += struct.pack('<HH', mat.shape[0], mat.shape[1])
        header += struct.pack('<I', length)
        length += 4 * mat.shape[0] * mat.shape[1]
    return header


mat1 = np.random.randint(-1024, 1024, [3, 3], dtype=np.int32)
mat2 = np.random.randint(-1024, 1024, [5, 9], dtype=np.int32)
mat3 = np.random.randint(-1024, 1024, [2, 2], dtype=np.int32)

with open('test.matrix', 'wb') as o:
    magic = b'THIS IS MAT FILE.\x01\x02'
    o.write(create_header(mat1, mat2, mat3, magic=magic))
    for mat in [mat1, mat2, mat3]:
        for y in mat:
            for x in y:
                o.write(struct.pack('<i', x))

I'm going to use KS to load the `test.matrix` generated by the above code.

test.matrix


  Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 	
00000000: 4D 41 54 01 02 2F 03 00 03 00 03 00 20 00 00 00    MAT../..........
00000010: 05 00 09 00 44 00 00 00 02 00 02 00 F8 00 00 00    ....D.......x...
00000020: DC FE FF FF 49 01 00 00 A7 FF FF FF 17 02 00 00    \~..I...'.......
00000030: 25 FC FF FF 35 FF FF FF B5 00 00 00 CF FE FF FF    %|..5...5...O~..
00000040: E2 FF FF FF 5D 00 00 00 15 FE FF FF 30 FC FF FF    b...]....~..0|..
00000050: 4C 03 00 00 C1 FF FF FF B0 FD FF FF 31 02 00 00    L...A...0}..1...
00000060: 54 03 00 00 C4 FF FF FF 65 FF FF FF D0 FE FF FF    T...D...e...P~..
00000070: 75 01 00 00 DE FE FF FF ED 00 00 00 ED FC FF FF    u...^~..m...m|..
00000080: BE FD FF FF E5 02 00 00 EC FE FF FF 22 FE FF FF    >}..e...l~.."~..
00000090: C3 02 00 00 11 00 00 00 29 03 00 00 00 01 00 00    C.......).......
000000a0: 78 00 00 00 C4 FC FF FF 4C 02 00 00 88 00 00 00    x...D|..L.......
000000b0: 43 FF FF FF 35 FF FF FF A4 00 00 00 CF 02 00 00    C...5...$...O...
000000c0: 3A FF FF FF 33 FF FF FF BD FE FF FF F9 01 00 00    :...3...=~..y...
000000d0: 22 FF FF FF 3A 02 00 00 7C 00 00 00 15 FF FF FF    "...:...|.......
000000e0: D8 FE FF FF 42 00 00 00 82 02 00 00 24 02 00 00    X~..B.......$...
000000f0: 8A FE FF FF AF FF FF FF EF 02 00 00 96 01 00 00    .~../...o.......
00000100: 83 01 00 00 2F 02 00 00  

The structure of the file starts from the beginning

  1. b'MAT\x01\x02/'
  2. Number of existing matrices (2 bytes)
  3. Shape and offset for each matrix ((8 * number of matrices) bytes)
  4. Matrix body It has become.

Let's write this in matrix.ksy.

KSY (Kaitai Struct YAML) declares a single user-defined type (literally translated from official). User-defined type

meta

meta


meta:
  id: matrix
  endian: le

Describe the name of the user-defined type to be described in meta / id. It must be present in the .ksy file. meta / endian describes the default endian used in the structure (le / be)

seq

seq


seq:
  - id: magic
    contents: ['MAT', 1, 0x2, '/']
  - id: header_num
    type: u2
  - id: headers
    repeat: expr
    repeat-expr: header_num
    type: header

Describe the data structure in seq. ʻIdis the variable name. If the data is a constant, write the constant incontents. If you want to get the value, describe the data type in type(Click here for details (https://doc.kaitai.io/ksy_reference.html#primitive-data-types)). You can also use the types described intypesdescribed later. Here, theheadertype is used. repeat can contain any of ʻexpr, ʻeos, ʻuntil (see here for details). ) If you put ʻexpr, put the number of repeats in repeat-expr`.

types

types


types:
  header:
    seq:
      - id: shape0
        type: u2
      - id: shape1
        type: u2
      - id: offset
        type: u4
    instances: 
      mat_body:
        pos: offset
        io: _root._io
        type: matrix

  matrix:
    seq:
      - id: dim0
        repeat: expr
        repeat-expr: _parent.shape0
        type: dim1
    types:
      dim1:
        seq:
          - id: dim1
            repeat: expr
            repeat-expr: _parent._parent.shape0
            type: s4

User-defined types can be nested in types. I'm using ʻinstances with the headertype, which can be used to read data other than those that exist in sequence, such asseq`.

header.instances


instances: 
  mat_body:
    pos: offset
    io: _root._io
    type: matrix

Usage is very similar to seq. ʻId is the mat_body here. ʻO is the IO stream to use. pos is the number of bytes from the beginning of ʻio. type is the same as for seq`.

About variables

Some fields (in this case repeat-expr, pos, ʻio) can reference variables as well as constant values. You cannot see data that has not been read yet. The data has a tree structure (it is easy to understand if you use ksv), and you can specify the parent element with _parent. You can also specify the top element with _root`.

Visualize

At this point, you have written the following code.

matrix.ksy


meta:
  id: matrix
  endian: le

seq:
  - id: magic
    contents: ['MAT', 1, 0x2, '/']
  - id: header_num
    type: u2
  - id: headers
    repeat: expr
    repeat-expr: header_num
    type: header

types:
  header:
    seq:
      - id: shape0
        type: u2
      - id: shape1
        type: u2
      - id: offset
        type: u4
    instances: 
      mat_body:
        pos: offset
        io: _root._io
        type: matrix

  matrix:
    seq:
      - id: dim0
        repeat: expr
        repeat-expr: _parent.shape0
        type: dim1
    types:
      dim1:
        seq:
          - id: dim1
            repeat: expr
            repeat-expr: _parent._parent.shape0
            type: s4

Let's visualize this using ksv (Kaitai Struct Visualizer). The usage is ksv <file_to_parse.bin> <format.ksy>.

shell


$ ksv test.matrix matrix.ksy

ksv


[-] [root]                              00000000: 4d 41 54 01 02 2f 03 00 03 00 03 00 20 00 00 00 | MAT../...... ...
  [.] magic = 4d 41 54 01 02 2f         00000010: 05 00 09 00 44 00 00 00 02 00 02 00 f8 00 00 00 | ....D...........
  [.] header_num = 3                    00000020: dc fe ff ff 49 01 00 00 a7 ff ff ff 17 02 00 00 | ....I...........
  [-] headers (3 = 0x3 entries)         00000030: 25 fc ff ff 35 ff ff ff b5 00 00 00 cf fe ff ff | %...5...........
    [-] 0                               00000040: e2 ff ff ff 5d 00 00 00 15 fe ff ff 30 fc ff ff | ....].......0...
      [.] shape0 = 3                    00000050: 4c 03 00 00 c1 ff ff ff b0 fd ff ff 31 02 00 00 | L...........1...
      [.] shape1 = 3                    00000060: 54 03 00 00 c4 ff ff ff 65 ff ff ff d0 fe ff ff | T.......e.......
      [.] offset = 32                   00000070: 75 01 00 00 de fe ff ff ed 00 00 00 ed fc ff ff | u...............
      [-] mat_body                      00000080: be fd ff ff e5 02 00 00 ec fe ff ff 22 fe ff ff | ............"...
        [-] dim0 (3 = 0x3 entries)      00000090: c3 02 00 00 11 00 00 00 29 03 00 00 00 01 00 00 | ........).......
          [-] 0                         000000a0: 78 00 00 00 c4 fc ff ff 4c 02 00 00 88 00 00 00 | x.......L.......
            [-] dim1 (3 = 0x3 entries)  000000b0: 43 ff ff ff 35 ff ff ff a4 00 00 00 cf 02 00 00 | C...5...........
              [.] 0 = -292              000000c0: 3a ff ff ff 33 ff ff ff bd fe ff ff f9 01 00 00 | :...3...........
              [.] 1 = 329               000000d0: 22 ff ff ff 3a 02 00 00 7c 00 00 00 15 ff ff ff | "...:...|.......
              [.] 2 = -89               000000e0: d8 fe ff ff 42 00 00 00 82 02 00 00 24 02 00 00 | ....B.......$...
          [-] 1                         000000f0: 8a fe ff ff af ff ff ff ef 02 00 00 96 01 00 00 | ................
            [-] dim1 (3 = 0x3 entries)  00000100: 83 01 00 00 2f 02 00 00                         | ..../...        
              [.] 0 = 535
              [.] 1 = -987
              [.] 2 = -203
          [-] 2
            [+] dim1
    [-] 1
      [.] shape0 = 5
      [.] shape1 = 9
      [.] offset = 68
      [-] mat_body
        [+] dim0
    [+] 2

It seems that it can be read well.

File decompression

This is the main subject. I made a compressed file in the article here. This time, decompress this compressed file using KS. See the article for the structure of the file.

mcp.ksy


meta:
    id: mcp
    encoding: UTF-8
    endian: le

seq:
  - id: file
    type: file
    repeat: eos
types:
    file:
        seq:
          - id: filename_len
            type: u4
          - id: filebody_len
            type: u4
          - id: filename
            type: str
            size: filename_len
          - id: filebody
            size: filebody_len
            process: zlib

meta / encoding specifies the default encoding to use with type: str. repeat: eos repeats until the end of the stream. process: zlib answers the read data with zlib. (Very convenient)

Generate code from mcp.ksy using ksc (Kaitai Struct Compiler).

usage


Usage: kaitai-struct-compiler [options] <file>...

  <file>...                source files (.ksy)
  -t, --target <language>  target languages (graphviz, csharp, all, perl, java, go, cpp_stl, php, lua, python, ruby, javascript)
  -d, --outdir <directory>
                           output directory (filenames will be auto-generated)
  -I, --import-path <directory>:<directory>:...
                           .ksy library search path(s) for imports (see also KSPATH env variable)
  --go-package <package>   Go package (Go only, default: none)
  --java-package <package>
                           Java package (Java only, default: root package)
  --java-from-file-class <class>
                           Java class to be invoked in fromFile() helper (default: io.kaitai.struct.ByteBufferKaitaiStream)
  --dotnet-namespace <namespace>
                           .NET Namespace (.NET only, default: Kaitai)
  --php-namespace <namespace>
                           PHP Namespace (PHP only, default: root package)
  --python-package <package>
                           Python package (Python only, default: root package)
  --opaque-types <value>   opaque types allowed, default: false
  --ksc-exceptions         ksc throws exceptions instead of human-readable error messages
  --ksc-json-output        output compilation results as JSON to stdout
  --verbose <value>        verbose output
  --debug                  enable debugging helpers (mostly used by visualization tools)
  --help                   display this help and exit
  --version                output version information and exit

shell


$ ksc -t python mcp.ksy

mcp.py


# This is a generated file! Please edit source .ksy file and use kaitai-struct-compiler to rebuild

from pkg_resources import parse_version
from kaitaistruct import __version__ as ks_version, KaitaiStruct, KaitaiStream, BytesIO
import zlib


if parse_version(ks_version) < parse_version('0.7'):
    raise Exception("Incompatible Kaitai Struct Python API: 0.7 or later is required, but you have %s" % (ks_version))

class Mcp(KaitaiStruct):
    def __init__(self, _io, _parent=None, _root=None):
        self._io = _io
        self._parent = _parent
        self._root = _root if _root else self
        self._read()

    def _read(self):
        self.file = []
        i = 0
        while not self._io.is_eof():
            self.file.append(self._root.File(self._io, self, self._root))
            i += 1


    class File(KaitaiStruct):
        def __init__(self, _io, _parent=None, _root=None):
            self._io = _io
            self._parent = _parent
            self._root = _root if _root else self
            self._read()

        def _read(self):
            self.filename_len = self._io.read_u4le()
            self.filebody_len = self._io.read_u4le()
            self.filename = (self._io.read_bytes(self.filename_len)).decode(u"UTF-8")
            self._raw_filebody = self._io.read_bytes(self.filebody_len)
            self.filebody = zlib.decompress(self._raw_filebody)

The code that mcp.py is generated. Let's use this to write a decompression script.

extract.py


from mcp import Mcp
import os
import sys

mcps = Mcp.from_file(sys.argv[1])
out = 'output/'
if len(sys.argv) >= 3:
    out = sys.argv[2]

for f in mcps.file:
    if os.path.dirname(f.filename):
        os.makedirs(os.path.join(out, os.path.dirname(f.filename)), exist_ok=True)
    with open(os.path.join(out, f.filename), 'wb') as o:
        o.write(f.filebody)

You can answer with python extract.py <target.mcp> [output_folder]

To read the file, use KaitaiStruct.from_file (file_path). If you want to read the byte string as it is, use KaitaiStruct.from_bytes (bytes). For IO streams, use KaitaiStruct.from_io (io).

at the end

I think KS is quite convenient. It's easy to write and you can use it in your favorite language, so the cost of learning new things is very low. The official reference is honestly hard to read, but more and more people like me will write articles about KS in the future (probably).

Would you like to "disassemble" using KS?

Recommended Posts

Create a binary data parser using Kaitai Struct
Create a data collection bot in Python using Selenium
Create a dummy data file
Instantly create a diagram of 2D data using python's matplotlib
Create a python GUI using tkinter
Create a nested dictionary using defaultdict
Create a binary file in Python
Create a CRUD API using FastAPI
Create a C wrapper using Boost.Python
Create an API that returns data from a model using turicreate
Create a graph using the Sympy module
Creating a data analysis application using Streamlit
Create document classification data quickly using NLTK
Create a dataframe from excel using pandas
Create a GIF file using Pillow in Python
Let's create a REST API using SpringBoot + MongoDB
Create a phylogenetic tree from Biopyton using ClustalW2
A story stuck with handling Python binary data
Create 3D printer data (STL file) using CadQuery
Create a web map using Python and GDAL
Create a visitor notification system using Raspberry Pi
Create a Mac app using py2app and Python3! !!
Create a MIDI file in Python using pretty_midi
I wrote a Japanese parser in Japanese using pyparsing.
Create a GUI on the terminal using curses
Create a shogi game record management application using Django 5 ~ Pass DB data to Template ~
Read the Python-Markdown source: How to create a parser
Create a color sensor using a Raspberry Pi and a camera
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 1 ~
Create dummy data using Python's NumPy and Faker packages
[Python] Create a ValueObject with a complete constructor using dataclasses
Create a pseudo REST API server using GitHub Pages
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 2 ~
I tried reading data from a file using Node.js.
Create a company name extractor with python using JCLdic
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 3 ~
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 4 ~
[CRUD] [Django] Create a CRUD site using the Python framework Django ~ 5 ~
Create a dictionary by searching the table using sqlalchemy