[PYTHON] Basics of Quantum Information Theory: Data Compression (1)

\def\bra#1{\mathinner{\left\langle{#1}\right|}} \def\ket#1{\mathinner{\left|{#1}\right\rangle}} \def\braket#1#2{\mathinner{\left\langle{#1}\middle|#2\right\rangle}}

Introduction

In Article before last, I took up "Holebo limit" and explained the limit of quantum communication channel. The assumption at that time was the story of a communication path called "classical-quantum-classical" in which classical information is encoded by qubits and transmitted, and classical information is decoded by POVM measurement on the receiving side. This time, I would like to talk about transmitting the quantum state itself as "information" instead. In classical Shannon information theory, there is a famous theorem that the minimum required communication capacity is determined if the nature (probability distribution) of the information source is known, and the quantum version of that theorem has been proved by Schumacher. I want to manage to understand that. It's going to be a little long, so I'll divide it into two parts. This time, the first time, as a prerequisite knowledge, "Shannon's coding theorem in a noise-free channel" in classical information theory will be explained, and in the second time, the main subject, "Schumacher's code in a noise-free quantum channel" I will explain the chemical theorem. Each time, once I understand it, I will try to simulate its coding behavior with a Python program [^ 1].

[^ 1]: By the way, I used the word "encoding" now, but you can think of it as synonymous with "data compression". "Encoding" is the conversion of information into a bit string. At that time, we want to "communicate" and "store" as efficiently as possible, that is, with less resources required. This created the motivation to "compress" information as much as possible, and research has been conducted for many years since Shannon. To be a little noisy, to be precise, I think that "data compression" is one of the purposes of "encoding", but since it is a very large purpose, it is used in almost the same meaning in various contexts. I think the current situation is that it is being used. Also, this time, we have limited it to "no noise", but of course, it may be "noisy". I haven't studied yet, so if I can understand it, I would like to introduce it again (the schedule is undecided).

The following documents were used as references.

i.i.d. Source

In a nutshell, what I would like to consider is the answer to the question of how much data can be compressed in principle when the nature of the information source is this. Therefore, it is necessary to abstract, model, and mathematically handle some kind of elusive information. The simplification to capture the essence can deviate slightly from the actual information, but if it gives practically useful results, that is fine. Based on this idea, we first abstract and model the information source.

For example, imagine textual information in English consisting of alphabetic character strings. When there is an English sentence "I have a dream.", The alphabet letters are "I", "$ \ space $", "h", "a", "v", "e" ... , It can be seen that it occurs in chronological order. There are many other English sentences in the world, and when you look at them, there are letters that are easy to occur (such as "e" and "a") and letters that are rarely generated (such as "q" and "z"). Since there is [^ 2], one idea of abstraction / modeling is to think of the alphabetic characters that occur at each point in time as random variables [^ 3]. It seems a natural assumption that the probabilities for this alphabetic character are common at all times, and even with the assumption that alphabetic characters that occur at different times occur independently of each other. Looks good [^ 4]. In this way, an information source that assumes that the values generated separately from the information source are independent and follow the same distribution (identically distributed) is called "iid information source (independent and identically distributed information source)". Is called. Based on this, we will proceed with the following discussions.

[^ 2]: Reference: Character frequency table

[^ 3]: "I have a dream." Is a passage from Martin Luther King, Jr.'s famous speech, and I think many people have read or heard it in English classes. No matter how many times I hear it, the speech is touching, and there is something that can be conveyed just by looking at this passage "I have a dream." What he was trying to say. From now on, I will explain that we should throw away all such implications, abstract them, and capture information as a mathematical object. That is the information theory that Shannon started. It's a very cool position, but it has changed the world a lot. I will link to the video of the speech. I start calling "I have a dream." From around 11 minutes and 40 seconds. [Japanese subtitles] Rev. King's speech "I have a dream" --Martin Luther King "I Have A Dream"

[^ 4]: If you imagine English, it's not exactly. For example, the probability that "h" comes after "t" is higher than in other cases. However, in many sources, even a rough assumption like this does not seem to be a false assumption. Also, starting with the simplest assumptions and gradually thinking about more complex ones is the usual way to build a theory. So, let's start with this simple assumption.

Typical series

Before Shannon's coding theorem, we need to explain the concepts of "typical series" and "atypical series". Now suppose the i.i.d. source creates random variables $ X_1, X_2, X_3, \ cdots $. For example, you can think of the phenomenon of throwing a coin and whether it comes out on the front or the back as a random variable, and imagine that it is repeated many times to create a series of front and back. If you assign a number "0" for the front and "1" for the back, the series will be a binary series. If it is an ordinary coin, both the front and back will occur with a probability of 1/2, but I would like to discuss it as generally as possible, so assuming that the coin is made so that the probability of the front appearing is $ p $ Put. Even so, it is a natural assumption that they are "identically distributed" because they throw the same coins every time. Also, which one comes out does not depend on past results (for example, since it was a "table" last time, it does not mean that the "table" will come out easily this time), so "independent" is also strange. Not an assumption. Therefore, the information obtained by throwing coins can be regarded as an i.i.d. source of information.

If you take $ n $ of random variables $ X_1, X_2, \ cdots, X_n $ from the infinite sequence of this iid source, the actual values $ x_1, x_2, \ cdots, x_n $ will be There are many possibilities, but it can be divided into series patterns that are likely to occur and series patterns that rarely occur. A sequence that is likely to occur is called a "typical sequence", and a sequence that rarely occurs is called an "atypical sequence". Let the probability of getting 0 be $ p = 1/4 $ and the number of series to be retrieved be $ n = 100 $. Then, you can imagine that the number of times 0 is output is about 25 times. Even if you repeat this 100 samplings many times, there will probably be some shaking, and the number of 0s should be concentrated around 25 times. On the contrary, if there are only 5 times or 85 times, this is either a very unusual phenomenon or the person throwing the coin is a squid master. In this example, a series in which 0 is output about 25 times is called a "typical series".

I said that the typical series is a series that is easy to occur, but how can the probability be quantified? If the probability of obtaining the series $ x_1, x_2, \ cdots, x_n $ by sampling n pieces from the random variable series is $ p (x_1, x_2, \ cdots, x_n) $, it is an iid source. ,

p(x_1,x_2,\cdots,x_n) = p(x_1)p(x_2) \cdots p(x_n) \approx p^{np} (1-p)^{n(1-p)}  \tag{1}

It will be. Taking the logarithm of both sides,

-\log p(x_1,x_2,\cdots,x_n) \approx -np \log p - n(1-p) \log (1-p) = n H(X)  \tag{2}

p(x_1,x_2,\cdots,x_n) \approx 2^{-nH(X)}  \tag{3}

It can be approximated using entropy. Here, the equation (3) was derived by imagining two values in which the random variables take only the values "0" and "1", but it holds even if there are multiple values. I will prove it.

[Proof]

Random variableX_i \space (i=1,2,\cdots)The symbols that occur inA=\\{ a_1,a_2,\cdots,a_{|A|}\\}Suppose it was one of.|A|Is a setAの要素数を表します。Random variable系列からn個をサンプリングし、その値が\\{ x_1,x_2,\cdots,x_n\\}The probability was i.i.d.Since it is premised on the source of information, it can be expressed as the product of each probability as follows.

p(x_1,x_2,\cdots,x_n) = p(x_1)p(x_2) \cdots p(x_n)  \tag{4}

Also, the functional form of $ p (x_i) $ is the same. The probability $ p (a_ {\ mu}) $ when $ x_i = a_ {\ mu} $ is $ N (a_) the number of times $ a_ {\ mu} $ appears in the $ n $ series. If you write {\ mu}) $,

p(a_{\mu}) \approx \frac{N(a_{\mu})}{n}  \tag{5}

Can be approximated to. Then

p(x_1,x_2,\cdots,x_n) \approx \prod_{\mu=1}^{|A|} p(a_{\mu})^{N(a_{\mu})} = \prod_{\mu=1}^{|A|} p(a_{\mu})^{np(a_{\mu})}  \tag{6}

It will be. Taking the logarithm of both sides,

- \log p(x_1,x_2,\cdots,x_n) \approx \sum_{\mu=1}^{|A|} np(a_\mu) \log p(a_{\mu}) = nH(X) \tag{7}

p(x_1,x_2,\cdots,x_n) \approx 2^{-nH(X)}  \tag{8}

Then, Eq. (3) could be proved in the same way. (End of proof)

Since the reciprocal of the probability is the number when it occurs, you can think of the number of typical series as about $ 2 ^ {nH (X)} $. This means that most information can be transmitted without fail if the code of $ nH (X) $ bits is prepared, considering that the information source is divided into n blocks and encoded. .. Of course, atypical series may occur, so in that case, give up and assign an appropriate code. Then, a transmission error will occur at a certain rate, but if n is made large enough, the rate can be made small enough, and in the end, reliable communication can be realized. This is exactly what Shannon proved. Before we get into this proof, we need a little deeper understanding of the typical series, so let's talk about it.

When $ \ epsilon> 0 $ is given,

2^{-n(H(X)+\epsilon)} \leq p(x_1,x_2,\cdots,x_n) \leq 2^{-n(H(X)-\epsilon)}  \tag{9}

The series $ x_1, x_2, \ cdots, x_n $ that satisfies the condition is called "$ \ epsilon $ typical series". The set of all such $ \ epsilon $ typical series of length $ n $ is represented as $ T (n, \ epsilon) $. Equation (9)

| \frac{1}{n} \log \frac{1}{p(x_1,x_2,\cdots,x_n)} - H(X) | \leq \epsilon  \tag{10}

It can be rewritten as. With this, we have been able to quantify the vaguely defined typical series so that it can be handled mathematically. So, next, I will explain three theorems about typical series in order.

Typical series theorem (1)

The description of Nielsen, Chan is quoted as it is.

"Fix $ \ epsilon> 0 $. For any $ \ delta> 0 $, a sufficiently large $ n $, the probability that the string is $ \ epsilon $ is at least $ 1- \ delta $. "

Well, I think there are many people who don't understand what they are saying at all, so I will chew a little. $ \ Epsilon $ is $ \ epsilon $, which determines the range of typical series. Now suppose you fix this $ \ epsilon $. Imagine some suitable positive number in your head (for example, 0.1). Then, another $ \ delta $ will appear. This can be set as any positive value. This theorem argues that no matter what $ \ delta> 0 $ you decide on, if you take $ n $ large enough, a series of $ n $ in length becomes a typical series of $ \ epsilon $. The probability is at least $ 1- \ delta $. For example, if you set $ \ delta $ to a relatively large value (such as 0.999), the claim of this theorem seems quite trivial. Because the probability of becoming a typical $ \ epsilon $ is very small = a very loose condition, so I feel that it is likely to hold even if it is not such a large $ n $. On the other hand, what if the value of $ \ delta $ is very small (for example, 0.000001)? The claim of this theorem is very strict. Because it says that the probability of becoming $ \ epsilon $ typical is almost 1. However, the value of $ n $ can be large enough. If you take $ n $ large enough, you can make the probability of becoming typical of $ \ epsilon $ greater than or equal to $ 1- \ delta $ (that is, close to 1 enough). That is what this theorem says. In a nutshell, it's a very simple story: if you take $ n $ big enough, any series will be a typical $ \ epsilon $ series. Consider an experiment in which you roll a dice and record the eyes that come out. Even if it does not become a typical series after 10 rotations, it must be a typical series if it is rolled 100 times or 1000 times. You're just saying that mathematically exactly. Let's prove it.

[Proof]

$ X_1, X_2, \ cdots $ are random variable sequences obtained from i.i.d. sources. Since $ p (X_i) $ is independent and has the same distribution, $-\ log p (X_i) $ is also independent and has the same distribution. This means that the expected value $ E (-log p (X)) $ of $-\ log p (X) $ is a sufficiently large $ n $ $-\ log p (X_i) $ when taken out. It can be approximated to the average value. That is, for any $ \ epsilon> 0, \ delta> 0 $

p(| \sum_{i=1}^{n} \frac{- \log p(X_i)}{n} - E(- \log p(X)) | \leq \epsilon) \geq 1-\delta  \tag{11}

There is always $ n> 0 $ that holds. Is this okay? No matter how small $ \ epsilon $ or $ \ delta $ you take, if you take $ n $ large enough, you can satisfy the above equation. In other words, if you take $ n $ that is large enough, you can make $ E (-\ log p (X)) = \ sum_ {i = 1} ^ {n} \ frac {-\ log p (X_i)} {n} $. So this is just a paraphrase of the so-called "law of large numbers".

here,

\begin{align}
& E(- \log p(X)) = H(X) \\
& \sum_{i=1}^{n} \log p(X_i) = \log(p(X_1,X_2,\cdots,X_n))  \tag{12} 
\end{align}

So, equation (11) is

p(| - \frac{\log(p(X_1,X_2,\cdots,X_n))}{n} - H(X) | \leq \epsilon) \geq 1-\delta  \tag{13}

It will be. Since the content of $ p $ on the left side is the definition of the $ \ epsilon $ typical series itself, equation (13) expresses that the probability of being a $ \ epsilon $ typical series is at least $ 1- \ delta $. So I was able to prove it. (End of proof)

Typical series theorem (2)

The description of Nielsen, Chan is quoted.

"Any fixed\epsilon >0、\delta >0Big enoughnAgainst\epsilonNumber of typical series|T(n,\epsilon)|Satisfies the following equation.

(1-\delta) 2^{n(H(X)-\epsilon)} \leq |T(n,\epsilon)| \leq 2^{n(H(X)+\epsilon)}  \tag{14}

」

In other words,\epsilon \to 0、\delta \to 0At the limit of, big enoughnIf you take|T(n,\epsilon)| \approx 2^{n(H(X)}It means that it can be approximated to. Earlier, the ceremony(8)Around the bottom, "The number of typical series is about2^{nH(X)}I said, but if you express it mathematically strictly, it becomes this theorem. Let's prove it.

[Proof]

Since the equation (9) for the probability that the series $ x = (x_1, x_2, \ cdots, x_n) $ will occur and the probability do not exceed 1,

1 \geq  \sum_{x \in T(n,\epsilon)} p(x) \geq \sum_{x \in T(n,\epsilon)} 2^{-n(H(X)+\epsilon)} = |T(n,\epsilon)| 2^{-n(H(X)+\epsilon)}  \tag{15}

|T(n,\epsilon)| \leq 2^{n(H(X)+\epsilon)}  \tag{16}

Is established. This proves the inequality sign on the right side of equation (14). Next is the inequality sign on the left.

From the theorem (1) and equation (9) of the typical series,

1-\delta \leq \sum_{x \in T(n,\epsilon)} p(x) \leq \sum_{x \in T(n,\epsilon)} 2^{-n(H(X)-\epsilon)} = |T(n,\epsilon)| 2^{-n(H(X)-\epsilon)}  \tag{17}

(1-\delta) 2^{n(H(X)-\epsilon)} \leq |T(n,\epsilon)|  \tag{18}

Is established. The combination of equations (16) and (18) gives equation (14). (End of proof)

Typical series theorem (3)

The description of Nielsen, Chan is quoted.

"$ S (n) $ is a collection of strings of length $ n $ from a source, and its size is at most $ 2 ^ {nR} $. Where $ R \ <H (X) $ Is fixed. At this time, the following equation holds for any $ \ delta > 0 $ and a sufficiently large $ n $.

\sum_{x \in S(n)} p(x) \leq \delta  \tag{19}

」

The number of typical series is approximately $ 2 ^ {nH (X)} $, but using $ R $, which is smaller than $ H (X) $, there are only about $ 2 ^ {nR} $ subsets. Think of $ S (n) $. It claims that if $ n $ is made large enough, the proportion of $ S (n) $ contained in the series set of length $ n $ will be small enough. At the limit of $ n \ to \ infinity $, the proportion converges to 0. is that true? You might want to say that, but it's true. I will prove it below. Please enjoy the power of power of 2.

[Proof]

Consider the $ S (n) $ string separately for typical and atypical series. From the typical series theorem (1), when $ n $ is sufficiently large, the number of atypical series contained in the series set of length $ n $ is sufficiently small, so that its subset $ S (n) The number of atypical series contained in) $ is also small enough. It is 0 in the limit of $ n \ to \ infinity $.

On the other hand, the number of typical series contained in $ S (n) $ is less than or equal to the number of series contained in $ S (n) $. In other words, it's just $ 2 ^ {nR} $. Since the probability of appearance of a typical series was $ 2 ^ {-nH (X)} $, the probability of appearance of $ S (n) $ is at most $ 2 ^ {n (R-H (X))} $. That is, for a sufficiently large $ n $

\sum_{x \in S(n)} p(x) \leq 2^{n(R-H(X))}  \tag{20}

is. Now that $ R <H (X) $, for any $ \ delta> 0 $

\sum_{x \in S(n)} p(x) \leq \delta  \tag{21}

It means that there exists $ n> 0 $ that satisfies. In other words, at the limit of $ n \ to \ infinity $,

\sum_{x \in S(n)} p(x) \to 0  \tag{22}

Is established. (End of proof)

Shannon's coding theorem in a noise-free channel

Now we are ready to prove Shannon's theorem, which is today's main event. In addition, I quote the description of Nielsen, Chan.

"$ \ {X_i \} $ is the iid source of entropy $ H (X) $. $ R> H (X) $. At this time, the rate $ R $ for this source. There is a reliable compression method for. Conversely, if $ R <H (X) $, no compression method is reliable. "

In a nutshell, you can build a compression method that accurately transmits this information, as long as you have a transmission rate that is slightly higher than the entropy of the source. Let's prove it.

[Proof]

First, when $ R> H (X) $.

Select $ \ epsilon $ that satisfies $ H (X) + \ epsilon \ leq R $, and consider $ T (n, \ epsilon) $, which is a set of $ \ epsilon $ typical series, with that $ \ epsilon $. From the assumptions of the typical series theorems (1) and (2) and $ R> H (X) $, for any $ \ delta> 0 $ and a sufficiently large $ n $

|T(n,\epsilon)| \leq 2^{n(H(X)+\epsilon)} \leq 2^{nR}  \tag{23}

Is established. In other words, the number of typical series is at most $ 2 ^ {nR} $, and the probability of generating such a series is at least $ 1- \ delta $. To put it a little bit more, if you take a sufficiently large $ n $, the series becomes a typical series with a probability of $ 1 $, and the number is $ 2 ^ {nR} $ [^ 5]. Therefore, if a series consisting of $ n $ is encoded using $ nR $ bits, it can be transmitted accurately enough. This is referred to as "there is a reliable compression method with rate R".

[^ 5]: The mathematical wording that comes up with $ \ epsilon $ or $ \ delta $ is unique and I think it's annoying at first, but if you try to put it exactly, it will be like this. However, if you look closely, you often say something that is surprisingly easy. In this case, if you roll the dice enough times, it becomes a typical series, and the number of the typical series is at most $ 2 ^ {nR} $ (however, $ R $ is one rotation of the dice). It will be a little larger than the entropy of the case).

Next is the case of $ R <H (X) $.

From the typical series theorem (3), when $ R <H (X) $, the probability that the series output from the information source is included in the $ 2 ^ {nR} $ series set $ S (n) $ It will be $ 0 $ for a sufficiently large $ n $. Therefore, there is no compression method that can be transmitted accurately. (End of proof)

Coding simulation

Various methods such as "Huffman code" and "arithmetic code" have been proposed as methods for reversibly compressing data to near the transmission rate corresponding to the entropy of the information source, but here, let's experience Shannon's theorem. We will implement a very primitive coding for the sole purpose. In other words, if you take a sufficiently large $ n $, you can achieve reliable data compression, so let's confirm by simulation that the transmission error will be reduced properly by changing $ n $. Masu [^ 6].

[^ 6]: So, the coding implemented here is not reversible, and even if it is lossy, we have not made any efforts to improve efficiency, so please think of it as an almost impractical method.

This time, the assumed source of information is a binary series consisting of $ \ {0,1 \} $. The compression method is very simple. Divide the entire binary sequence you want to transmit into blocks of $ n $ symbols. Determines if the binary series contained in each block is a typical series, and if it is a typical series, assigns the sign of the $ nR $ bits so that it is unique to the typical series. If not, give up and assign a sign indicating "unknown".

Implementation

Here is the entire Python code.

import random
import numpy as np
from collections import Counter

def generator(prob, N):

    return [0 if random.random() < prob else 1 for i in range(N)]

def entropy_exp(s):

    prb_0 = list(s).count('0') / len(s)
    prb_1 = 1.0 - prb_0
    if prb_0 == 1.0 or prb_1 == 1.0:
        ent_exp = 0.0
    else:
        ent_exp = - prb_0*np.log2(prb_0)- prb_1*np.log2(prb_1)

    return ent_exp
    
def coder(data, blk, ent, eps, rat):

    # binary data to symbols (= message)
    # ex) [0,1,1,0,1,0,0,1,...] -> ['0110','1001',...](for blk = 4) 
    message = [''.join(map(str,data[i:i+blk])) for i in range(0,len(data),blk)]

    # histogram
    syms = list(Counter(message).keys())
    frqs = list(Counter(message).values())
    prbs = [frqs[i]/len(message) for i in range(len(frqs))]
    hist = dict(zip(syms, prbs))

    # codebook
    index = 0
    digits = int(rat*blk)  # number of digits for each code
    codebook = {}
    for s,p in hist.items():
        ent_exp = entropy_exp(s)
        if abs(ent_exp - ent) <= eps and index < 2**digits-1:
            codebook[s] = '{:0{digits}b}'.format(index, digits=digits)
            index += 1
    codebook[False] = '{:0{digits}b}'.format(index, digits=digits)

    # coding (message -> codes)
    codes = [codebook[s] if s in codebook else codebook[False]
             for s in message]

    # decodebook
    decodebook = {}
    for k,v in codebook.items():
        decodebook[v] = k
    
    return (codes, decodebook)

def decoder(codes, decodebook, block):

    message = [decodebook[c] for c in codes]
    blocks = [list(s) if s != False else generator(0.5,block) for s in message]
    data_str = sum(blocks, [])

    return [int(d) for d in data_str]

def error_rate(data_in, data_out):

    N = len(data_in)
    count = 0
    for i in range(N):
        if data_in[i] != data_out[i]:
            count += 1
            
    return count / N

if __name__ == "__main__":

    # settings
    BLK = 10          # block size
    NUM = BLK * 1000  # number of binary data sequence
    PRB = 0.8         # probability to generate '0'
    ENT = - PRB * np.log2(PRB) - (1-PRB) * np.log2(1-PRB)  # entropy (= 0.7219..)
    RAT = 0.9         # transmission rate (RAT > ENT)
    EPS = 0.15        # eplsilon for typical sequence
    print("NUM = ", NUM)
    print("BLK = ", BLK)
    print("ENT = ", ENT)
    print("RAT = ", RAT)

    # generate data sequence
    data_in = generator(PRB, NUM)

    # code the data
    (codes, decodebook) = coder(data_in, BLK, ENT, EPS, RAT)

    # decode the data
    data_out = decoder(codes, decodebook, BLK)

    # evaluate error rate
    err_rate = error_rate(data_in, data_out)
    print("error rate = ", err_rate)

I will explain what you are doing. Look at the main processing section.

# settings
BLK = 10          # block size
NUM = BLK * 1000  # number of binary data sequence
PRB = 0.8         # probability to generate '0'
ENT = - PRB * np.log2(PRB) - (1-PRB) * np.log2(1-PRB)  # entropy (= 0.7219..)
RAT = 0.9         # transmission rate (RAT > ENT)
EPS = 0.15        # eplsilon for typical sequence

This is the part to set various parameters. BLK is the size of the block, that is, the number of binary values contained in one block. NUM is the total number of binary values you want to transmit (assuming you want to transmit 1000 blocks of binary values). PRB is the probability that "0" will occur (set to 0.8). ENT is the entropy per symbol (calculated with a PRB value of 0.8, it is 0.7219 ..). RAT is the transmission rate. Data compression will not occur unless it is made larger than 0.7219 .. and smaller than 1.0, so I set it to an appropriate value of 0.9. EPS is the value of the threshold $ \ epsilon $ for determining the typical series. I have to make it a moderately small value [^ 7], but if I make it too small, the error will not be small unless I set a very large $ n $, so I chose 0.15 as an appropriate value.

[^ 7]: Must be $ \ epsilon $ that satisfies $ H (X) + \ epsilon \ leq R $.

# generate data sequence
data_in = generator(PRB, NUM)

First, the function generator creates a series of binary values to be transmitted. Randomly generate an integer value of "0" or "1" according to the probability set in PRB, and output a list of length NUM. See the function definition for details.

# code the data
(codes, decodebook) = coder(data_in, BLK, ENT, EPS, RAT)

The function coder encodes data_in and outputs the code sequence codes and the dictionary decodebook for decoding. Let's take a look at the contents of the function.

# binary data to symbols (= message)
# ex) [0,1,1,0,1,0,0,1,...] -> ['0110','1001',...](for blk = 4) 
message = [''.join(map(str,data[i:i+blk])) for i in range(0,len(data),blk)]

Since the data to be transmitted is a list consisting of integer values "0" and "1", make it a character string for each block and make it a list called message. An example for block size 4 is provided in the comments.

# histogram
syms = list(Counter(message).keys())
frqs = list(Counter(message).values())
prbs = [frqs[i]/len(message) for i in range(len(frqs))]
hist = dict(zip(syms, prbs))

Create a histogram from the elements contained in message and store it in a dictionary called hist.

# codebook
index = 0
digits = int(rat*blk)  # number of digits for each code
codebook = {}
for s,p in hist.items():
    ent_exp = entropy_exp(s)
    if abs(ent_exp - ent) <= eps and index < 2**digits-1:
        codebook[s] = '{:0{digits}b}'.format(index, digits=digits)
        index += 1
codebook[False] = '{:0{digits}b}'.format(index, digits=digits)

Now that we know the elements of the message that appear, that is, the series consisting of n symbols, we decide what code to assign to each and store it in the dictionary data called codebook. In order to determine whether the symbol series s included in the histogram is a typical series in the above for loop, the function entropy_exp calculates the entropy for this series. If the difference between the entropy and the true entropy is less than or equal to $ \ epsilon $, it is regarded as a typical series and the sign index of 0.9 * n bits is assigned in order from 0. In the case of atypical series, it is set to False to mean "unknown", and the maximum value of index is assigned as the code for it.

# coding (message -> codes)
codes = [codebook[s] if s in codebook else codebook[False]
         for s in message]

Now, using the obtained codebook, convert the message into a list of code words codes.

# decodebook
decodebook = {}
for k,v in codebook.items():
    decodebook[v] = k

Create a dictionary decodebook that corresponds to the inverse conversion of codebook so that it can be used for decryption.

return (codes, decodebook)

Returns the codes and decodebook.

Return to the main processing section.

# decode the data
data_out = decoder(codes, decodebook, BLK)

Decode the codes with the decoder function. At that time, use the decode book. As you can see from the function definition, it is decoded while referring to the decodebook, but if the result of decoding is False, a random series will be generated with an error preparedness.

Once the decryption is complete

# evaluate error rate
err_rate = error_rate(data_in, data_out)
print("error rate = ", err_rate)

Use the error_rate function to calculate and display the error rate.

Operation check

Now let's try different block sizes ($ n $ values) to see what the error rate will be. First, when $ n = 10 $.

NUM =  10000
BLK =  10
ENT =  0.7219280948873623
RAT =  0.9
error rate =  0.3499

The error rate is very high. About 35% are errors. Next is the case of $ n = 100 $.

NUM =  100000
BLK =  100
ENT =  0.7219280948873623
RAT =  0.9
error rate =  0.03194

Compared to the previous time, it decreased by one digit to about 3.2%. Finally, if $ n = 200 $.

NUM =  200000
BLK =  200
ENT =  0.7219280948873623
RAT =  0.9
error rate =  0.00454

It decreased by one digit to about 0.45%.

So, I found that if you fix the value of $ \ epsilon $ and increase the value of $ n $, the error rate will decrease steadily. Shannon's theorem argues that in this limit the error rate is zero.

in conclusion

This time, I was talking about classical information theory, so I didn't use the quantum calculation simulator qlazy. If you copy and paste the code, it will work in your Python environment, so if you are interested, please play with it.

Next time, as mentioned in "Introduction", I will explain "Schumacher's coding theorem in a noise-free quantum channel", which is a quantum version of Shannon's theorem. "Typical subspace" comes out as a concept similar to a typical series, but the theorem is derived in the same flow as this discussion. It is interesting that quantum information can be compressed in the same way as classical information, and its marginal performance is also quantum entropy.

that's all