Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Source Coding

Source Coding

In this chapter we take a look at the basic principles and algorithms for data compression.

1The role of coding

In the general block diagram of a communication system, the coding block is situated between the information source and the communication channel.

Communication system{width=35%}

It’s role is to prepare the raw information in order to be transmitted over the channel. It has two main jobs:

  1. Source coding:

    • Convert source messages to channel symbols, i.e. the actual symbols which the channel knows to transmit.

      For example, express the messages in binary form (zeros and ones), for sending over a binary channel.

    • Minimize the number of symbols needed to be transmitted (i.e., data compression). We don’t want to transmit more symbols than necessary to recover the messages at the receiving end.

    • Adapt probabilities of symbols in order to maximize to maximize mutual information. We will discuss this more in Chapter IV.

  2. Error control coding

    • Protect the information against channel errors

    • Also known as “channel coding

Basically, source coding refers to all the procedures required to express the source messages as channel symbols in the most efficient way possible, while error control coding refers to all the algorithms used to protect the data against errors.

The coding block has a corresponding decoding block on the receiving end. Its job is to “undo” all the coding operations:

Is it possible to do these two jobs separately, one after another, in two consecutive operations? Yes, as the source-channel separation theorem establishes.

We give below only an informal statement of the theorem:

In this chapter, we consider only the source coding algorithms, without any error control. Basically, we assume that data transmission is done over an ideal channel with no noise, and therefore the transmitted symbols are perfectly recovered at the receiver.

In this context, our main concern is how to minimize the number of symbols needed to represent the messages, while making sure that the receiver can decode the messages correctly. The advantages of data compression are self-evident:

2Definitions

Let’s define what coding means from a mathematical perspective.

Consider an input information source with the set of messages:

S={s1,s2,...sN}S = \left\lbrace s_1, s_2, ... s_N \right\rbrace

and suppose we would like to express the messages as a sequence of the following code symbols:

X={x1,x2,...xM}X = \left\lbrace x_1, x_2, ... x_M \right\rbrace

The set XX is known as the alphabet of the code.

For example, for a binary code we have X={0,1}X = \lbrace 0, 1\rbrace and a possible sequence of symbols is c=00101101c = 00101101

The codewords are the sequences of symbols used by the code.

The codeword length, which we denote as lil_i, is the the number of symbols in a given codeword cic_i.

Encoding a given message or sequences of messages means replacing each message with its codeword.

Decoding means deducing back the original sequence of messages, given a sequence of symbols.

As an example, the ASCII code is a widely-used code for encoding characters, consisting of 256 codewords (stored on 1 byte):

ASCII code (partial)

Nowadays, ASCII is usually replaced by Unicode (UTF-8), which is a more general code using 65536 codewords (stored on 2 bytes), allowing for many more letters used in languages around the globe.

2.1The graph of a code

The codewords of a code can represented as a binary tree. We call this representation the graph of the code.

TODO: Example at blackboard

2.2Average code length

There are many ways to define the codewords for messages, but some are better than others.

The most basic quantity indicating efficiency of a code is the average code length.

For every codeword cic_i, we multiply the probability of the corresponding message p(si)p(s_i) with its length lil_i.

A code with smaller average length is better, because it represents sequences of messages with less symbols, on average. However, there is a certain lower limit to how small the average length can be (for example, it cannot be 0, for self-evident reasons).

This raises the following interesting question:

Given a source SS, how small can the average length be?

This is a fundamental question, to which we will provide an answer later in this chapter.

2.3Instantaneous codes

We introduce another set of useful definitions regarding the codeword structure.

A code can be:

Examples: at the blackboard

The follwing relations exist between these types of codes.

We can summarize the relation between these three code types as follows:

Instantaneous \subset uniquely decodable \subset non-singular

Graph-based decoding of instantaneous codes

Using the graph of the code, we can use a very simple procedure for decoding any instantaneous code:

TODO: TBD: Illustrate at whiteboard

This procedure shows the advantage of instantaneous codes over other codes which might be uniquely decodable, but are not instantaneous. Instantaneous codes allow for simple decoding. There is never any doubt about the next message in the sequence.

This decoding scheme is also showing why these codes are named “instantaneous”: a codeword can be decoded as soon as it is fully received, immediately, without any delay, because the decoding takes place on-the-fly while the bits are received one-by-one.

As a counter example, consider the following uniquely decodable, but non-instantaneous code: {0,01,011,1110}\left\lbrace 0, 01, 011, 1110 \right\rbrace. When you read the first 0, you cannot decode it yet, because you need to wait the next bits to understand how to segment the sequence. This implies that the decoding has some delay.

3The Kraft inequality theorem

When can an instantaneous code exist? Given a DMS SS, are we sure we can find an instantaneous code for it, and if yes, under which conditions?

The answers to these questions is provided by the Kraft inequality.

Comments on the Kraft inequality:

We have seen that instantaneous codes must obey the Kraft inequality But how about uniquely decodable codes? The answer is given by the next theorem.

Consequences of the McMillan theorem:

3.1Finding an instantaneous code for given lengths

How to find codewords with code lengths {li}\{l_i\}?

In general, we may use the following procedure:

Note that this procedure only gives us the codewords, not the mapping to a particular set of messages.

In practice, there might be more elaborate ways to find the codewords and map them to the messages of the source, with additional benefits.

4Minimum average length

We will discuss now one of the most important aspect of this chapter,

Given a DMS SS, suppose we want to find an instantenous code for it, but in such a way as to minimize the average length of the code:

l=ip(si)li\overline{l} = \sum_i p(s_i) l_i

How can we find such an optimal code?

Given that it should be instantaneous, the codeword lengths must obey the Kraft inequality.

In mathenatical terms, we formulate the problem as a constrained optimization problem:

minimize ip(si)lisubject to iDli1\begin{aligned} \textbf{minimize } &\sum_i p(s_i) l_i \\ \textrm{subject to } &\sum_i D^{-l_i} \leq 1 \end{aligned}

This means that we want to find the unknowns lil_i in order to minimize a certain quantity (ip(si)li\sum_i p(s_i) l_i), but the unknowns must satisfy a certain constraint (iDli1\sum_i D^{-l_i} \leq 1).

4.1The method of Lagrange multipliers

To solve this problem, we use a standard mathematical tool known as the method of Lagrange multipliers.

Let’s use this mathematical formulation to our problem. In our case, the functions and the variables involved are the following:

The resulting optimal codeword lengths are:

li=log(p(si))l_i = -\log(p(s_i))

The fundamental interpretation of this result is that, in order to minimize the average length of a code, a message with higher probability must have a shorter codeword, while a message with lower probability must have a longer codeword. This makes sense because messages we can afford to use a longer codeword for messages which appear rarely in a sequence.

4.2Entropy is the limit

Another fundamental consequence of the preceding theorem gives us a functional definition of the entropy, i.e. it tells us what the entropy really says about an information source.

Assume we have a binary code (we choose it binary in order to match the choice of log2()\log_2() in the definition of entropy).

Since the optimal codeword lenghts are li=log2(p(si))l_i = -\log_2(p(s_i)), then the minimal average length obtainable is equal the entropy of the source:

minl=ip(si)limin=ip(si)log(p(si))=H(S)\min \overline{l} = \sum_i p(s_i) {l_i}_{min} = -\sum_i p(s_i) \log(p(s_i)) = H(S)

This result tells us what the entropy really means, besides the algebric definition provided in chapter I. The entropy gives us the minimum number of bits required to represent the data generated by an information source in binary form. Although we rigorously proved this result only for a discrete memoryless source, it holds for sources with memory as well.

Some immediate consequences:

Reformulating this in a different way, we can say that he average length of an uniquely decodable code must be at least as large as the source entropy. One can never represent messages, on average, with a code having average length less than the entropy.

H(S)lH(S) \leq \overline{l}

Efficiency and redundancy of a code

Considering a code for an information source, comparing the average code length with the entropy indicates the efficiency of the encoding. These quantities indicate how close is the average length to the optimal value.

Coding with the wrong code

An interesting question arises in practice. Consider a source with probabilities q(si)q(s_i). We design a code for this source, but we use it to encode messages from a different source, with probabilities p(si)p(s_i).

This happens in practice when we estimate the probabilities from a sample data file, and then we use the code for general of similar files. Those files might not have exactly the same frequencies of messages as the sample file used to design the code.

How does this affect us? We lose some efficiency, because the codeword lengths lilog2(qi)\overline{l_i} \approx -\log_2(q_i) are not optimal for the source with probabilities pip_i, which leads to increased average length.

If the codewords were optimal, the codeword lengths would have been log2(pi)\approx -log_2(p_i), leading to an average length equal to the entropy H(S)H(S):

loptimal=p(si)logp(si)\overline{l_{optimal}} = -\sum{p(s_i) \log{p(s_i)}}

But the actual average length we obtain is:

lactual=p(si)li=p(si)logq(si)\overline{l_{actual}} = \sum{p(s_i) l_i} = -\sum{p(s_i) \log{q(s_i)}}

The difference between average lengths is what we lose due to the mismatch:

lactualloptimal=ip(si)log(p(si)q(si))=DKL(pq)\overline{l_{actual}} - \overline{l_{optimal}} = \sum_i{p(s_i) \log(\frac{p(s_i)}{q(s_i)})} = D_{KL}(p || q)

The difference is exactly the Kullback-Leibler distance between the two distributions pp and qq.

Therefore we have a functional definition of the KL distance: the KL distance between two distributions pp and qq is the number of extra bits spent when we encode messages from pp using a code optimized for a different distribution qq.

Due to the KL distance properties, the distance is always 0\geq 0, which shows that such a mismatch always increases the average length. The more different the two distributions are, the larger the increase in average length becomes.

5Coding procedures

5.1Optimal and non-optimal codes

A code for which η=1\eta = 1 is known as an optimal code. Such a code attains the minimum average length:

l=H(S)\overline{l} = H(S)

However, this is only possible if the probabilities are powers of 2, such that li=log2(p(si))l_i = -log2(p(s_i)) is an integer number, because the codeword lengths must always be natural numbers.

An optimal code can always be found for a source where all p(si)p(s_i) are powers of 2, using for example the graph-based procedure explained earlier.

However, in the general case li=log(p(si))l_i = -\log(p(s_i)) might not be an integer number. This happens when p(si)p(s_i) is not a power of 2. What to do in this case?

A simple solution to this problem is to round every logarithm to next largest natural number:

li=log(p(si))l_i = \lceil -\log(p(s_i)) \rceil

for example:

log(p(si))=2.15=>li=3-\log(p(s_i)) = 2.15 => l_i = 3

This leads to the Shannon coding algorithm described below.

5.2Shannon coding

The Shannon coding procedure is a simple algorithm for finding an instantaneous code for a DMS.

The code obtained with this procedure is known as a Shannon code.

Shannon coding is one of the simplest algorithms, and some other better schemes are available. However, this simplicity allows to prove fundamental results in coding theory, as we shall see below.

One such fundamental result shows that the average length of a Shannon code is actually close to the entropy of the source.

Although a Shannon code is not optimal, in the sense that it does not lead to the minimum achievable average length, we see that it’s also not very bad. The average length of Shannon code is at most 1 bit longer than the minimum possible value, which is quite efficient. However, better procedures do exist.

5.3Shannon’s first theorem

Shannon’s first theorem (coding theorem for noiseless channels) shows us that, in theory, we can approach the entropy as close as we want, using extended sources.

The theorem shows that, in principle, we can always obtain lH(S)\overline{l} \to H(S), if we consider longer and longer sequences of messages from the source, modeled as larger and larger extensions of the source.

However, this result is mostly theoretical, because in practice the size of SnS^n gets too large for large nn. Other (better) algorithms than Shannon coding are used in practice to approach H(S)H(S).

5.4Shannon-Fano coding (binary)

The Shannon-Fano (binary) coding procedure is presented below.

It’s easier to understand the Shannon-Fano coding by looking at an example at the whiteboard, or at a solved exercise.

Shannon-Fano coding does not always produce the shortest code lengths, but in general it is better than Shannon coding in this respect.

As a remark, there is a clear connection between Shannon-Fano coding and the optimal decision tree examples from the first chapter (with yes-no answers). Splitting the messages into two groups is similar to splitting the decision tree into two branches.

5.5Huffman coding (binary)

Huffman coding produces a code with the smallest average length (better than Shannon-Fano). It can be proven (though we won’t give the proof here) that no other algorithm which assigns a separate codeword for each message can produce a code with smaller average length than Huffman coding. Some better algorithms do exist, but they work differently: they do not assign a codeword to every single message, instead they code a whole sequence at once.

Assigning 0 and 1 in the backwards pass can be done in any order. This leads to different codewords, but with the same codeword lengths, so the resulting codes are equivalent.

When inserting a sum into an existing list, it may be equal to another value(s), so we may have several options where to insert: above, below or in-between equal values. This leads to codes with different individual lengths, but same average length. In general, if there are several options available, it is better to insert the sum as high as possible in the list, because this leads to a more balanced tree, and therefore the codewords tend to have more equal lengths. This is helpful in practice in things like buffering. Inserting as low as possible tends to produce codewords with more widely varying lengths, which increases the demands on the buffers.

5.6Huffman coding (M symbols)

All coding procedures can be extended to the general case of MM symbols {x1,x2,...xM}\left\lbrace x_1, x_2, ... x_M \right\rbrace. We illustrate here the generalization for Huffman coding.

Procedure:

Example : blackboard

5.7Probability of symbols

Given a code CC with symbols xix_i, we can compute the probability of apparition of each symbol in the code.

The average number of apparitions of a symbol xix_i is:

lxi=ip(si)lxi(si)\overline{l}_{x_i} = \sum_i p(s_i) l_{x_i}(s_i)

where lxi(si)l_{x_i}(s_i) is number of symbols xix_i in the codeword of sis_i (e.g. the number of 0’s and 1’s in a binary codeword)

If we divide this to the average length of the code, l\overline{l}, we obtain the probability of symbol xix_i:

p(xi)=lxilp(x_i) = \frac{\overline{l}_{x_i}}{\overline{l}}

These are the probabilities of the input symbols for the transmission channel, which play an important role in Chapter IV (transmission channels).

6Source coding as data compression

Many times, the input messages are already written in a binary code, such as the ASCII code for characters.

In this case, source coding can be understood as a form of data compression, by replacing the original binary codewords with different ones, which are shorter on average, thus using fewer bits in total to represent the same information.

This process is similar to removing predictable or unnecessary bits from a sequence, We thus reducing the redunancy of the original data. The resulting compressed data appear random and patternless.

To obtain the original data back, the decoder must undo the coding procedure. If there were no errors during the transmission, the original data is recovered exactly. This makes it a lossless compression method.

As a final note, we mention that other coding methods do exist beyond Huffman coding, which provide better results (arithmetic coding) or other advantages (adaptive schemes).