Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Discrete Information Sources

Discrete Information Sources

1A mathematical definition of information

In order to analyze information generation, encoding and transmission with mathematical tools, we need a solid and clear definition of information.

So, what is information?

Let’s start first with a simple example. Imagine someone rolls a dice and tells you the resulting value:

“The value is 5”

Does this message carry information? How, why, how much?

When thinking about the information brought by such a phrase, let’s consider the following principles:

All this suggests that the information is related to the probability of some event happening, in a inverse manner: the lower the probability, the higher the information.

1.1Definition of information

Throughout this class, we refer to a probabilistic event as a “message”

Consider a message (event) sis_i which happens with probability p(si)p(s_i). The information attached to sis_i is rigorously defined as:

i(si)=log2(p(si))i(s_i) = -\log_2(p(s_i))

Consequences of this definition:

i(si)0i(s_i) \geq 0
log(1)=0-\log(1) = 0
i(sisj)=log2(p(sisj)=log2(p(si)p(sj)=log2(p(si)log2(p(sj))=i(si)+i(sj)\begin{align*} i(s_i \cap s_j) &= -\log_2(p(s_i \cap s_j) \\ &= -\log_2(p(s_i) \cdot p(s_j) \\ &= -\log_2(p(s_i) - log_2(p(s_j)) \\ &= i(s_i) + i(s_j) \end{align*}

The information is a purely abstract mathematical quantity, and it is not related to the meaning of the message, and neither to its particular form. It’s not related to the fact that the message is written as a sentence in English, or in some other language or different signs. The information attached to an event simply states that “this event happened”, and depends only on the probability of that event.

Encoding the information as a particular sequence of symbols (letters, bits etc.) is a different matter, and shall be discussed in later chapters.

1.2Choice of logarithm

Using the logarithm function in the definition is crucial, since it is responsible for most of these properties. In particular, the fact that logarithm transforms a product into a sum allws to sum the information of independent events.

The choice of logarithm base is merely a convention. Any base of logarithm can be used in the definition, not just base 2, and all the consequences still hold.

The choice of the algorithm is not critical and doesn’t change anything fundamental, since logarithm bases can always be converted to/from one another:

logb(x)=loga(x)loga(b)\log_b(x) = \frac{\log_a(x)}{\log_a(b)}

This means that information defined using different logarithms differ only by a scaling factor:

ib(si)=ia(si)loga(b)i_b(s_i) =\frac{i_a(s_i)}{\log_a(b)}

Following the convention of the scientific literature, we shall use the base 2 logarithm log2()\log_2() from now on.

1.3Information Sources

A probabilistic event is always part of a set of multiple events, containing all the possible outcomes which can happen at a given time. Continuing the previous example, the value we obtained was 5, but we could have obtained any interger between 1 and 6. There are 6 messages, each having its own probability, all known beforehand. At a given time, only one of the events can happen, and it carries the information that it happened (out of all the possible events).

We define an information source as the set of all events, together with their probabilities. The set of all messages forms the “alphabet” of the source. When an event takes place, we say that “the information source generates a message”.

We are very rarely interested in a single message. Instead, we are interested analyzing large amounts of messages. An information source creates a sequence of messages, by generating messages one after another, randomly, according to the known probabilities. (e.g. like throwing a coin or a dice several times in a row).

Depending on how the messages are generated, we distinguish between two types of information sources:

  1. Memoryless sources: each new message is generated independently on the previous messages

  2. Sources with memory: when generating a new message, the probabilities depend on one or more of the previous messages

2Discrete memoryless sources

The set of probabilities is the distribution of the source, also known as a probabilty mass function.

We represent a DMS as below, by giving it a name (“S”), listing the messages (s1,s2,s3s_1, s_2, s_3) and the probability distribution:

S:(s1s2s3121414)\sIII{S}{\fIoII}{\fIoIV}{\fIoIV}

A DMS is a discrete, complete and memoryless information source. Below we give the definition of these terms.

A good example of a DMS is a coin, or a dice.

One message generated by DMS is also called a random variable in probabilistics.

A DMS produces a sequence of messages by randomly selecting a message every time, with the same fixed probabilities, producing a sequence like:

s3s2s4s1s2s1s3s_3 s_2 s_4 s_1 s_2 s_1 s_3 \dots

For example, throwing a dice several times in a row you can get a sequence

4,2,3,2,1,6,1,5,4,54, 2, 3, 2, 1, 6, 1, 5, 4, 5 \dots

In a sequence which is very long, with length NN \to \infty, each message sis_i appears in the sequence approximately p(si)Np(s_i) \cdot N times. This gets more precise as NN gets larger.

2.1Entropy of a DMS

We usually don’t care about the information of single message. We are interested in long sequences of messages (think millions of messages, like in millions of bits of data). To analyze this in an easy manner, we need the average information of a message from a DMS.

In general, the average value of any set of quantities xkx_k is defined like

x=kp(xk)xk\overline{x} = \sum_{k} p(x_k) \cdot x_k

where xkx_k are the individual values and p(xk)p(x_k) are their probabilities (or weights).

For a DMS, because the messages are independent, we can compute the entropy as a weighted average of the information of each message.

Since information of a message is measured in bits, entropy is measured in bits (or bits / message, to indicate it is an average value).

Entropies using information defined with different logarithms base are differ only by a scaling factor. If Ha(S)H_a(S) is the entropy computed using the logarithm base aa, and Hb(S)H_b(S) is the entropy computed using the logarithm base bb, then these values are related through simple scaling:

Hb(S)=Ha(S)loga(b)H_b(S) =\frac{H_a(S)}{\log_a(b)}

However, as mentioned earlier, we use the convention based on logarithm base 2. In this case we simply denote the entropy as H(S)H(S).

2.2Interpretation of the entropy

The entropy of an information source S is a fundamental quantity. It allows us to compare two different sources, which model different scenarios from real life.

All the following interpretations of entropy are true:

We shall see in Chapter III that the entropy H(S)H(S) says something very important about the number of bits requires to represent data in binary form:

Thus, H(S)H(S) is crucial when we discuss how to represent data efficiently.

2.3Properties of entropy

We prove the following three properties for the entropy of a DMS:

2.4Entropy of a binary source

Consider a general DMS with two messages:

S:(s1s2p1p)\sII{S}{p}{1-p}

It’s entropy is:

H(S)=plog(p)(1p)log(1p)H(S) = -p \cdot \log(p) - (1-p) \cdot \log(1-p)

The entropy value as a function of pp is represented below:

Entropy of a binary source

Figure 1:Entropy of a binary source

As an illustration of the property no.2 from above, we can see that the maximum entropy value of a DMS with two messages is reached when the two messages have the same probability, p=0.5p = 0.5:

p(s1)=p=p(s2)=1p=0.5p(s_1) = p = p(s_2) = 1 - p = 0.5

and its value is:

Hmax(S)=1H_{max}(S) = 1

2.5Study case: games of guessing numbers

Let’s analyze the following guessing games with the tools introduced until now.

2.6Efficiency, redundancy, flow

Using the H(S)H(S), we define several other useful characteristics of a DMS.

The efficiency of a DMS indicates how close is the entropy to its maximum possible value:

η=H(S)Hmax=H(S)log(n)\eta = \frac{H(S)}{H_{max}} = \frac{H(S)}{\log(n)}

The redundancy of a source is the remaining gap. We can define an absolute redundancy and a relative redundancy.

Absolute redundancy of a DMS:

R=HmaxH(S)R = H_{max} - H(S)

Relative redundancy of a DMS:

ρ=HmaxH(S)Hmax=1η\rho = \frac{H_{max} - H(S)}{H_{max}} = 1 - \eta

Suppose that each message sis_i takes some time tit_i to be transmitted via some communication channel. The information flow of a DMS SS is the average information transmitted per unit of time:

Hτ(S)=H(S)t,H_\tau(S) = \frac{H(S)}{\overline{t}},

where t\overline{t} is the average duration of transmitting a message:

t=ipiti\overline{t} = \sum_{i} p_i t_i

The information flow is measured in bps (bits per second), and is important for data communication.

2.7The Kullback-Leibler distance

Suppose we have the following two DMS:

P:(s1s2s3s40.140.290.40.17)Q:(s1s2s3s40.130.330.430.11)\begin{gather*} \sIV{P}{0.14}{0.29}{0.4}{0.17} \\ \sIV{Q}{0.13}{0.33}{0.43}{0.11} \end{gather*}

The probability values of PP and QQ are close, so the two sources are similar. But exactly how similar? Can we quantify the “closeness” of the two sources?

In many application we need a way to quantify how similar or how different are two probability distributions. The Kullback-Leibler distance (also known as “Kullback-Leibler divergence”, or “cross-entropy”, or “relative entropy”) is a way to quantify numerically how much different is one distribution from another one.

The Kullback–Leibler (KL) distance of two distributions P and Q is:

DKL(P,Q)=ip(si)log(p(si)q(si))D_{KL}(P,Q) = \sum_i p(s_i) \log(\frac{p(s_i)}{q(s_i)})

The two distributions must have the same number of messages.

The KL distance provides a meaningful way to to measure the distance (difference) between two distributions. In many ways it provides the same intuitions as a geometrical distance:

  1. DKL(P,Q)D_{KL}(P, Q) is always 0\geq 0, and is equal to 0 only when P and Q are the same

  2. The higher DKL(P,Q)D_{KL}(P, Q) is, the more different the two distributions are

However, one important property is not satisfied, and for this reason the KL distance is not proper distance function as defined e.g. in mathematical algebra. The KL distance is not commutative: :

DKL(P,Q)DKL(Q,P)D_{KL}(P, Q) \neq D_{KL}(Q, P)

Despite this, it is widely used in applications.

2.8Extended DMS

The n-th order extension of a DMS SS, represented as SnS^n, is a DMS which has as messages σi\sigma_i all the combinations of nn messages of SS:

σi=sjsk...sln\sigma_i = \underbrace{s_j s_k ... s_l}_{n}

If SS has kk messages, SnS^n has knk^n messages,

Since SS is DMS, consecutive messages are independent of each other, and therefore their probabilities are multiplied:

p(σi)=p(sj)p(sk)...p(sl)p(\sigma_i) = p(s_j) \cdot p(s_k) \cdot ... \cdot p(s_l)

An example is provided below:

S:(s1s21434)\sII{S}{\fIoIV}{\frac{3}{4}}
S2:(σ1=s1s1σ2=s1s2σ3=s2s1σ4=s2s2116316316916)\snIV{S^2}{\sigma_1 = s_1 s_1}{\frac{1}{16}}{\sigma_2 = s_1 s_2}{\frac{3}{16}}{\sigma_3 = s_2 s_1}{\frac{3}{16}}{\sigma_4 = s_2 s_2}{\frac{9}{16}}
S3:(s1s1s1s1s1s2s1s2s1s1s2s2s2s1s1s2s1s2s2s2s1s2s2s2........................)S^3: \left( \begin{matrix} s_1 s_1 s_1 & s_1 s_1 s_2 & s_1 s_2 s_1 & s_1 s_2 s_2 & s_2 s_1 s_1 & s_2 s_1 s_2 & s_2 s_2 s_1 & s_2 s_2 s_2 \\ ... & ... & ... & ... & ... & ... & ... & ... \end{matrix} \right)

Extended DMS are useful because they provide a way to group messages inside a sequence of messages.

Suppose we have a long sequence of binary messages:

01001100111001000 1 0 0 1 1 0 0 1 1 1 0 0 1 0 0

What kind of source generated this sequence?

  1. We can view it as a sequence of 16 messages generated from a source S1S_1 with two messages, s1=0s_1 = 0 and s2=1s_2 = 1

  2. We can group two bits, and view it as a sequence of 8 messages generated from a source S2S_2 with messages 00, 01, 10, 11

  3. We can group 8 bits into bytes, and view it is a sequence of 2 messages from a DMS S8S_8 which generates 256 bytes

  4. ... and so on

There must be a connection between the DMS, no matter how we group the bits, since we’re talking about the same binary sequence. The connections is that they are all just n-th order extensions of the initial binary DMS.

2.9Entropy of an extended DMS

We now prove an important theorem about extended DMS.

This theorem has a nice interpretation: when having a long sequence in of messages, we may group together blocks of nn messages as we like, this does not change total information of a sequence. This makes sense because we’re talking about the same sequence, even if we group bits into half-bytes, bytes, 32-bit words etc.

For example, imagine when we have a sequence of 1000 bits:

0,1,1,0,0,1,0,....0,10,1,1,0,0,1,0, .... 0,1

We can think of this sequence as being 1000 bits from a binary source SS which generates 0’s and 1’s. The total information in the sequence is 1000H(S)\approx 1000 \cdot H(S).

However, if we group 8 bits into 1 byte, we can think of the same sequence as being 125 bytes from a source which produces bytes. This source is simply the extension S8S^8 of the original source SS, since its messages are just groupings of 8 messages from the original source. It’s entropy is therefore 8H(S)8 \cdot H(S). Hence the total information in the sequence is the same: 1258H(S)\approx 125 \cdot 8 H(S).

The theorem shows that both interpretations are correct. Grouping 8 messages of a source results in compound messages which are 8 times fewer, but carry 8 times more average information, so the total information is the same.

2.10DMS as models for language generation

We use information sources as mathematical models for real-life data generation and analysis. A straightforward example in in text analysis, since text is basically a sequence of graphical symbols (letters and punctuation signs), similar to a sequence of messages from an information source.

Is a DMS a good model for text? Let’s take the following example, for the English language. The probability distribution of letters in English (26 letters, ignoring capitalization) is given below [1]:

Probability distribution of letters in English

Figure 1:Probability distribution of letters in English

Let’s image a DMS which has the 26 letters as messages, with these probabilities. Generating a sequence of letters from this DMS produces the following:

Text generated from a DMS with English letter probabilities

Figure 1:Text generated from a DMS with English letter probabilities

This doesn’t look like English. What’s wrong?

A DMS is memoryless, which means that every message is generated irrespective of the previous ones. This is not a good model for written text. In a real language, the frequency of letter depends a lot on the previous letters. Foe example, a is a common letter in English (probability 8.2%8.2 \%), but if the previous letter is also a, the probability is close to zero because the group aa is extremely rare. Similarly, h has a much higher probability if the previous letter is t then if the previous letter is x.

The DMS is not capturing the dependencies between letters, because the memoryless property makes it very restrictive. We need to consider sources with memory.

3Sources with memory

The last mm messages define the state of the source, which is denoted as SiS_i. We say that the source “is in the state SiS_i”. Sources with memory are also known as Markov sources.

A source with nn messages and memory mm has a number of states equal to nmn^m.

A source with memory generates messages randomly, but the message probabilities are different depending in which state the source is. We use the notation:

p(siSk)p(s_i | S_k)

to refer to probability of message sis_i being generated when the source is in state SkS_k. This is called the conditional probability of message sis_i given state SkS_k.

3.1Transition matrix

When a new message is provided, the source transitions to a new state:

...sisjskold statesl...\underbrace{s_i s_j s_k}_{\text{old state}} s_l
...sisjskslnew state...s_i \underbrace{s_j s_k s_l}_{\text{new state}}

Therefore we can view the conditional probabilities of messages sis_i as transition probabilities from some state SuS_u to another state SvS_v. We can write:

p(siSu)=p(SvSu)p(s_i | S_u) = p(S_v | S_u)

which means that generating some message sis_i while in state SuS_u is the same as transitioning into state SvS_v from state SuS_u.

The transition probabilities are organized in a transition matrix [T][T] of size N×NN \times N, where NN is the total number of states.

[T]=[p11p12...p1Np21p22...p2N............pN1pN2...pNN][T] = \begin{bmatrix} p_{11} & p_{12} & ... & p_{1N} \\ p_{21} & p_{22} & ... & p_{2N} \\ ... & ... & ... & ... \\ p_{N1} & p_{N2} & ... & p_{NN} \\ \end{bmatrix}

The element pijp_{ij} from [T][T], located on row ii and column jj, is the transition probability from state SiS_i to state SjS_j:

pij=p(SjSi)p_{ij} = p(S_j | S_i)

Note that the sum of the probabilities on each row is 1, because the source must move to some state after generating a message.

3.2Graphical representation

The transition matrix which defines a source with memory can be represented graphically, as a directed graph where the vertices are the states, and the edges are the transitions. Every edge (transition) has a certain probability,

The graphical representation of the previous example source is:

A source with memory m=1 and n=4 messages

Figure 1:A source with memory m=1m=1 and n=4n=4 messages

For example, generating message s3s_3 while in state S2S_2 means we move from state S2S_2 to state S3S_3 (defined as last message s2s_2), so we have an arrow from S2S_2 to S3S_3 with probability 0.15.

Note that the sum of the probabilities of the transitions exiting from a state is 1, since the source must move to some state after generating a message. It is possible that the source stays in the same state, which is represented as a self-loop.

3.3Entropy of sources with memory

How to compute the entropy of a source with memory?

In each state SkS_k there is different distribution, so each state can be viewed as a kind of DMS. We can compute an entropy H(Sk)H(S_k) for every state SkS_k, using the same formula as for DMS:

H(Sk)=ip(siSk)log(p(siSk))H(S_k) = - \sum_i p(s_i | S_k) \cdot \log(p(s_i | S_k))

However, H(Sk)H(S_k) is the entropy of a single state. While operating, the source moves from state to state, and it can spend more time in a state than in another one.

The global entropy of a source with memory is the average entropy of the states:

H(S)=kpkH(Sk)=kpkip(siSk)log(p(siSk)\begin{aligned} H(S) &= \sum_k p_k H(S_k) \\ &= - \sum_k p_k \sum_i p(s_i | S_k) \cdot \log(p(s_i | S_k) \end{aligned}

The probabilities pkp_k are known as the stationary probabilities, and they represent the probability that the source is in state SkS_k at a given moment. Considering that the source operates for a very long time and generates a very long sequence of messages, you can think of pkp_k as the fraction of time when the source was in state SkS_k.

Stationary probabilities

How to find out the stationary probabilities pkp_k? To find this, we first need to answer the following question:

If we know the state SkS_k at time nn, what will be the state at time n+1n+1?

Let pi(n)p_i^{(n)} denote the probability that the source is in state SiS_i at time nn. The source generates a message. In what state will the source end up at time n+1n+1?

The probabilities of the states at time n+1n+1, pi(n)p_i^{(n)}, are found by multiplying with TT

[p1(n),p2(n),...,pN(n)][T]=[p1(n+1),p2(n+1),...,pN(n+1)][p_1^{(n)}, p_2^{(n)}, ... , p_N^{(n)}] \cdot [T] = [p_1^{(n+1)}, p_2^{(n+1)}, ... , p_N^{(n+1)}]

After one additional message, at time (n+2)(n+2)? Multiply once more with TT:

[p1(n),p2(n),...,pN(n)][T][T]=[p1(n+2),p2(n+2),...,pN(n+2)][p_1^{(n)}, p_2^{(n)}, ... , p_N^{(n)}] \cdot [T] \cdot [T] = [p_1^{(n+2)}, p_2^{(n+2)}, ... , p_N^{(n+2)}]

For every new moment of time, we do one more multiplication with TT. In general, if we start from time 0, after nn messages, the probabilities that the source ends up in a certain state are:

[p1(0),p2(0),...,pN(0)][T]n=[p1(n),p2(n),...,pN(n)][p_1^{(0)}, p_2^{(0)}, ... , p_N^{(0)}] \cdot [T]^{n} = [p_1^{(n)}, p_2^{(n)}, ... , p_N^{(n)}]

However, in general we don’t know the initial state or the initial probabilities.

Ergodicity

To work around the problem of the unknown initial state, we make use a property called “ergodicity”.

If an ergodic source runs for a very long time MM \to \infty, it will go through all transitions and all states many times, and, eventually, the fraction of time it finds itself in a certain state SkS_k stabilizes. This happens irrespective of what was the starting state. Intuitively, the initial state is not important anymore if the source will anyway travel through all states and transitions many times, as MM \to \infty.

We formalize this as the following property of an ergodic source with memory:

Finding the stationary probabilities

The ergodicity property helps us find the values of the stationary probabilities. When nn is very large, after nn messages and after n+1n+1 messages the probabilities are the same, and therefore the following equation holds:

[p1,p2,...pN][T]=[p1,p2,...pN][p_1, p_2, ... p_N] \cdot [T] = [p_1, p_2, ... p_N]

Note that we dropped the time exponent indicator (n)(n) or (n+1)(n+1), since the values have converged to fixed values and the times doesn’t matter anymore.

This is an equation system in matrix form, with MM unknowns and MM equations.

However, the system is rank-deficient, i.e. one row is actually a linear combination of the others. Because of this, one row of the system should be removed, and replaced with a new equation which reflects the fact that the sum of probabilities is 1:

p1+p2+...+pN=1p_1 + p_2 + ... + p_N = 1

With this new equation, we obtain a complete system, which has a unique set of solutions by solving the system.

3.4Example: Modelling English

This example is taken from “Elements of Information Theory” by Cover & Thomas.

Let us consider a sequence of information sources modelling English language, going progressively from the most rudimentary model (memoryless), to the most advanced (source with memory of large order).

Let’s look at a sample sequence of letters generated from these sources.

  1. Text generated by memoryless source, where all letters have equal probabilities:

  2. A memoryless source, but the probabilities of each letter as the ones in English:

  3. A source with memory of order m=1m=1, i.e. frequency of letter pairs are as in English:

  4. A source with memory m=2m=2, i.e. the frequency of letter triplets as in English:

  5. A source with memory m=3m=3, frequency of 4-plets as in English:

The generated text, while still random, gradually looks more and more like actual English as the memory order increases.

Sources with more memory are able to capture better the statistical dependencies between the letters, and because of this the generated text looks more and more like English.

3.5Working with information sources

When using information sources to model a real-life process, you will encounter some typical use-cases.

  1. How to train an information source (e.g. find the probabilities)

  2. Generate sequences from a source

  3. Compute the probability of an existing sequence

Training an information source model

By training a model we mean finding the correct value of the parameters. In our case, the parameters are the probabilities of messages.

For simplicity, we consider the case of text analysis in a certain language (e.g. English).

First, we need a large sample of text, representative for the language.

How to find the probabilities of a DMS?

Go through your sample text, count the occurrences of every distinct character, then divide the counters to the total number of characters. You obtain the probabilities of each individual characters.

How to find the probabilities of a source with memory of order mm?

A direct approach to find the transition matrix [T][T] is as follows:

One drawback is the huge memory requirements required to count all the possible combinations. The memory requirement increases at least exponentially with the memory order mm. For example, if there are 26 letters, a source with memory of order m=3m = 3 has 263=1757626^3 = 17576 states, while a source with m=4m=4 has 264=45697626^4 = 456976 states. Considering capital letters and punctuation signs, the numbers are much larger. This makes it very cumbersome to work with sources of large memory, at least with such a brute-force approach.

Generating messages from a source

Supposing we have a source, we generate sequences by generating messages one after another.

For a DMS, we generate each message independently of all the previous ones.

For a source with memory mm, described by a transition matrix TT, we generate a message according to the transition probabilities from the row of TT corresponding to the current state. Then we update the current state and repeat the process.

For a source with memory, we must also specify how to generate the first mm messages, i.e. before we have the first mm previous messages which define a full state.

Compute the probability of an existing sequence

Suppose we have a sequence Seq of messages, e.g.

“wirtschaftwissenschaftler”

If we have an information source, how do we compute the probability that the sequence Seq was generated from the source?

Multiplying many probabilities quickly results in a very small number, which makes it problematic on a digital device due to numerical errors.

To avoid this, instead of probabilities we can work with the log-probabilities, i.e. log(p)\log(p), which allows for much more manageable values. Instead of multiplying probabilities, we sum the log-probabilities.

P=p(w)p(i)p(r)p(r)log(P)=log(p(w))+log(p(i))+log(p(r))++log(p(r))\begin{gather*} P = p(w) \cdot p(i) \cdot p(r) \cdot \dots \cdot p(r) \\ \log(P) = \log(p(w)) + \log(p(i)) + \log(p(r)) + \dots + \log(p(r)) \end{gather*}

3.6Sample applications

Predict text

Guess language of a sentence

Footnotes
  1. Image is taken from the book “Elements of Information Theory" by Cover, Thomas