[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python

Introduction

In high school, the curriculum guidelines have been revised, and the subject "Information" has also changed significantly. The biggest change is to learn programming from high school. Regarding this, the Ministry of Education, Culture, Sports, Science and Technology has released teaching materials for teacher training free of charge. In this article, I would like to write supplementary information related to programming that will be learned in high school in the future.

Various articles have already been written by our ancestor engineers, so please check the links below. [Unraveling the 2020 Course of Study Subject "Information"](https://qiita.com/woadachi/items/75cc334b10e22e4c5cad "Unraveling the 2020 Course of Study" Information ") Python of the Ministry of Education is not Python [Full-time faculty member of the subject "Information" and future dreams "Current status of IT engineer / programmer"](https://qiita.com/woadachi/items/fdcb67bed4d496d5823e "Full-time faculty member of the subject" Information "and future dream" IT engineer " ・ Current status of "programmers" ")

Teaching materials

[High School Information Department "Information I" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/1416756.htm "High School Information Department "Information I" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 2 Communication and Information Design (PDF: 6892KB) PDF

environment

In the new course of study, "Information I" mainly adds programming, and "Information II" mainly adds data science fields. Considering that, there are many explanations by code in python, so this article also uses python. In addition, it is assumed that Google Colab, which makes it easy to build an environment, is used.

Overview

This time, we will use "Information I" for teacher training materials, "Chapter 2 Communication and Information Design". In accordance with the implicit policy of Information I, when programming with python, we will confirm and supplement the elements that are not written in the teaching materials. Specifically, on pages 61-63 of the teaching material "Chapter 2 Communication and Information Design"

(9) File compression-Huffman method

If you check the place, you can see the explanation of the Huffman algorithm, but the ** essential programming code ** is not included. So I would like to implement python code.

Other references, etc.

Huffman coding with Python3 I implemented Huffman coding in Python [Huffman Code-Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%8F%E3%83%95%E3%83%9E%E3%83%B3%E7%AC%A6% E5% 8F% B7 "Huffman code-Wikipedia")

Huffman method (character string: about intelligence)

Step 1

Commentary

Arrange the characters in order of appearance count

Implementation

Let's use the python dictionary type and arrange them in order of the number of occurrences.


from collections import Counter
target_string = 'intelligence'

print('{0}Check the number of occurrences of the character' .format(target_string))
list_target_string = list(target_string)
counter_target_string = Counter(target_string)
dict_target_string = dict(counter_target_string)
print(dict_target_string)

Output result

Check the number of occurrences of the intelligence character
{'i': 2, 'n': 2, 't': 1, 'e': 3, 'l': 2, 'g': 1, 'c': 1}

Step 2

Commentary

Connect two characters with the least number of appearances and write the total number of appearances on it. Since the selected characters have already been added to the total, select the two characters with the least number of appearances from the total of the other 5 characters and the total number of characters in ②, enter the total, and all the characters are connected by the same procedure. Continue until the total is intelligence (12).

Implementation

class Node:
    def __init__(self, key_char=None, count=None, left=None, right=None):
        self.key_char = key_char
        self.left = left
        self.right = right
        self.count = count

nodes = []

for k, v in dict_target_string.items():
    node_obj = Node(key_char = k, count = v)
    nodes.append(node_obj)

trial_count = 0
while len(nodes) >= 2:
    #Sort nodes in descending order
    nodes = sorted(nodes, key=lambda x: x.count, reverse=True)

    #Number of branch creations
    trial_count += 1

    #Extract the nodes with the smallest number of occurrences
    left = nodes.pop()
    #Extract the node with the second smallest number of occurrences
    right = nodes.pop()

    #Add the number of occurrences of left and right to nodes
    count = left.count + right.count
    append_node = Node(key_char = left.key_char + right.key_char, count = count, left = left, right = right)
    nodes.append(append_node)

    print("Number of enforcement:", trial_count)
    print(left.key_char, right.key_char,  count)
    print("---------------")

Output result

Number of enforcement: 1
c g 2
---------------
Number of enforcement: 2
t cg 3
---------------
Number of enforcement: 3
l n 4
---------------
Number of enforcement: 4
i tcg 5
---------------
Number of enforcement: 5
e ln 7
---------------
Number of enforcement: 6
itcg eln 12
---------------

Step 3

Commentary

Put 0 on the left and 1 on the right of all the connected parts. Pick up the 0s and 1s in the spliced part from the top and assign the compressed bit string to each character.

Implementation

#Get the compression result
def recursive_code(node, s, temp_encode_dict):
    if len(node.key_char) == 1:
        temp_encode_dict[node.key_char] = s
        return
    recursive_code(node.left, s + "0", temp_encode_dict)
    recursive_code(node.right, s + "1", temp_encode_dict)

encode_dict = {}
tree = nodes[0]
recursive_code(tree, "", encode_dict)
print(encode_dict)

Output result

{'i': '00', 't': '010', 'c': '0110', 'g': '0111', 'e': '10', 'l': '110', 'n': '111'}

Compression rate

Commentary

If "intelligence" is expressed as a compressed bit string using the created Huffman code, it will be 33 bits. If one character of "intelligence" was represented by 3 bits, it was 36 bits, which means that it was compressed to about 92%.

Implementation

print(target_string,'Compression rate comparison')
bitlength = len(str(bin(len(dict_target_string) - 1))) - 2
before_size = target_len * bitlength
print('Size before compression')
print(target_len,'*',bitlength ,'=', before_size, 'bits')

encode_bits_string = ''
for v in list_target_string:
    encode_bits_string += encode_dict[v]
print('Bit string after compression')
print(encode_bits_string)
print('Compressed size')
print(len(encode_bits_string),'bits')

Output result

Intelligence compression ratio comparison
Size before compression
12 * 3 = 36 bits
Bit string after compression
001110101011011000011110111011010
Compressed size
33 bits

Comparison / correction of implementation results with teaching materials

The explanation of the teaching materials is as follows.

SnapCrab_NoName_2020-7-19_19-52-2_No-00.png SnapCrab_NoName_2020-7-19_19-57-8_No-00.png


Let's compare it with the output result implemented here.

Check the number of occurrences of the intelligence character
{'i': 2, 'n': 2, 't': 1, 'e': 3, 'l': 2, 'g': 1, 'c': 1}
{'i': '00', 't': '010', 'c': '0110', 'g': '0111', 'e': '10', 'l': '110', 'n': '111'}

**···that? ** ** ** The output result is different. ** ** Since the method of generating branches and the method of distributing branches to the left and right are the contents that explain only the logic ignoring the implementation, this does not result as the teaching material. The following points should be noted.

--When arranging characters by the number of occurrences, it is necessary to arrange them in a stable sort. --Select two items that appear less frequently, but select items that have the same number of alphabets as much as possible. (Priority is given to the same number of connected characters in the first alphabet as the second one) --As for how to specify the left and right of the branch, the one with the smaller number of connected alphabets is assigned to the left, the one that is not assigned to the right is assigned to the right, and if not, the one acquired first is assigned to the right and the second one. The acquired one is on the left.

Therefore, implement according to this condition.

Step 2'

Implementation


class Node:
    def __init__(self, key_char=None, count=None, left=None, right=None):
        self.key_char = key_char
        self.left = left
        self.right = right
        self.count = count

nodes = []
encode_dict = {}

for k, v in dict_target_string.items():
    node_obj = Node(key_char = k, count = v)
    nodes.append(node_obj)

trial_count = 0
while len(nodes) >= 2:
    #Sort nodes in descending order
    nodes = sorted(nodes, key=lambda x: x.count, reverse=True)

    #Number of branch creations
    trial_count += 1

    #Extract the nodes with the smallest number of occurrences
    first = nodes.pop()
    #Extract the node with the second smallest number of occurrences
    second = nodes.pop()

    #temp_nodes
    temp_nodes = []

    #Determine if the first and second alphabetic combinations are the same
    if (len(first.key_char) != len(second.key_char)):
        print('The number of alphabet combinations of first and second is different')

        #When there is still one or more nodes, get the third and subsequent nodes
        while len(nodes) >= 1:
            overthird = nodes.pop()
            if (second.count == overthird.count):
                print('The number of occurrences of second and over third match')
                if (len(first.key_char) == len(overthird.key_char)):
                    print('The number of alphabetic combinations of first and overthird match')
                    nodes.append(second)
                    second = overthird
                    break
                else:
                  temp_nodes.append(overthird)
            else:
                nodes.append(overthird)
                break
    
    #Return what was popped
    nodes += temp_nodes

    #Add the number of occurrences of first and second to nodes
    count = first.count + second.count

    print("Number of enforcement:", trial_count)

    if len(first.key_char) < len(second.key_char): 
        append_node = Node(key_char = first.key_char + second.key_char, count = count, left = first, right = second)
    else:
        append_node = Node(key_char = second.key_char + first.key_char, count = count, left = second, right = first) 

    print(append_node.left.key_char, append_node.right.key_char,  count)

    nodes.insert(0, append_node)
    print("---------------")

~~ It became a uselessly redundant process. To fucking code at once ~~

Output result

Number of enforcement: 1
g c 2
---------------
Number of enforcement: 2
l t 3
---------------
Number of enforcement: 3
i n 4
---------------
The number of alphabet combinations of first and second is different
The number of occurrences of second and over third match
The number of alphabetic combinations of first and overthird match
Number of enforcement: 4
lt gc 5
---------------
The number of alphabet combinations of first and second is different
Number of enforcement: 5
e in 7
---------------
The number of alphabet combinations of first and second is different
Number of enforcement: 6
ein ltgc 12
---------------

Correction result

Result of executing step 3 again

{'e': '00', 'i': '010', 'n': '011', 'l': '100', 't': '101', 'g': '110', 'c': '111'}

It was in agreement with the explanation of the teaching materials. By the way, if you compare the compression ratio,

Intelligence compression ratio comparison
Size before compression
12 * 3 = 36 bits
Bit string after compression
010011101001001000101100001111100
Compressed size
33 bits

Source code

https://gist.github.com/ereyester/68ef4653890f0823e81e99dc00d73a07

Recommended Posts

[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
Object detection using YOLO (python) ([High School Information Department Information II] Teacher training materials)
[High School Information Department Information I / Information II] Summary of teaching materials for teacher training by python
Binary classification by decision tree by python ([High school information department information II] teaching materials for teacher training)
Classification by k-nearest neighbor method (kNN) by python ([High school information department information II] teaching materials for teacher training)
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
[High School Information Department Information I] Teaching materials for teacher training: Data format and visualization (python)
Principal component analysis with python (Scikit-learn version, pandas & numpy version) ([High school information department information II] teaching materials for teacher training)
Object detection using YOLO (python) ([High School Information Department Information II] Teacher training materials)
[High School Curriculum Guidelines Information I] Teaching materials for teacher training: Implementation of Huffman method in python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
Web teaching materials for learning Python
[High School Information Department] Information I / Information II Reiwa 3rd year supplementary teaching materials Exercise examples
List method argument information for classes and modules in Python
Implementation of quicksort in Python
[Fundamental Information Technology Engineer Examination] I wrote an algorithm for the maximum value of an array in Python.
Web teaching materials for learning Python
Implementation of life game in Python
Implementation of original sorting in Python