Computing entropy of files

8. Computing entropy of files#

8.1. Introduction#

In this application we treat a file like a sequence of bytes generated by a Discrete Memoryless Source (DMS).

We estimate the entropy of the file by computing the empirical frequency of each byte, and then using the formula for the entropy of a DMS.

\[ H(X) = - \sum_{i=0}^{255} p_i \log_2(p_i) \]

where \(p_i\) is the frequency of the byte \(i\) in the file.

8.2. Walkthrough#

We estimate the frequency of every byte by counting.

First, let’s open a file and read its content, incrementing a counter for each byte. We’re going to use the read_bytes() metod available in Pathlib package.

import pathlib

filename = "data/texten.txt"

# Prepare counters
counts = [0] * 256

for byte in pathlib.Path(filename).read_bytes():
    counts[byte] += 1

Now, let’s print some statistics about the counters:

print(f"Total bytes = {sum(counts)}")

print(f"{'Code':>10} {'Char':>10} {'Count':>10}")
for i, count in enumerate(counts):
    if chr(i).isalpha():
        print(f"{i:>10d} {i:>10c} {count:>10d}")
Total bytes = 16407
      Code       Char      Count
        65          A        147
        66          B         88
        67          C         57
        68          D         63
        69          E        110
        70          F         29
        71          G         58
        72          H         45
        73          I        228
        74          J          3
        75          K          5
        76          L         54
        77          M         44
        78          N         59
        79          O        190
        80          P         19
        81          Q          0
        82          R        111
        83          S         70
        84          T        128
        85          U         12
        86          V         13
        87          W         48
        88          X          1
        89          Y         22
        90          Z          2
        97          a        697
        98          b        127
        99          c        202
       100          d        360
       101          e       1164
       102          f        211
       103          g        181
       104          h        632
       105          i        615
       106          j          7
       107          k         75
       108          l        355
       109          m        262
       110          n        624
       111          o        863
       112          p        134
       113          q          7
       114          r        613
       115          s        653
       116          t        827
       117          u        363
       118          v        117
       119          w        189
       120          x          8
       121          y        237
       122          z          3
       170          ª          0
       181          µ          0
       186          º          0
       192          À          0
       193          Á          0
       194          Â          0
       195          Ã          0
       196          Ä          0
       197          Å          0
       198          Æ          0
       199          Ç          0
       200          È          0
       201          É          0
       202          Ê          0
       203          Ë          0
       204          Ì          0
       205          Í          0
       206          Î          0
       207          Ï          0
       208          Ð          0
       209          Ñ          0
       210          Ò          0
       211          Ó          0
       212          Ô          0
       213          Õ          0
       214          Ö          0
       216          Ø          0
       217          Ù          0
       218          Ú          0
       219          Û          0
       220          Ü          0
       221          Ý          0
       222          Þ          0
       223          ß          0
       224          à          0
       225          á          0
       226          â          0
       227          ã          0
       228          ä          0
       229          å          0
       230          æ          0
       231          ç          0
       232          è          0
       233          é          0
       234          ê          0
       235          ë          0
       236          ì          0
       237          í          0
       238          î          0
       239          ï          0
       240          ð          0
       241          ñ          0
       242          ò          0
       243          ó          0
       244          ô          0
       245          õ          0
       246          ö          0
       248          ø          0
       249          ù          0
       250          ú          0
       251          û          0
       252          ü          0
       253          ý          0
       254          þ          0
       255          ÿ          0

Let’s print a histogram of the letter counters.

import matplotlib.pyplot as plt

chars_hist = [chr(i) for i in range(65, 122)]
count_hist = [counts[i] for i in range(65, 122)]

plt.bar(chars_hist, count_hist)
plt.show()
../../_images/498c9af709e5648ce592cb0518481ab06abb3124369b8ae44b235b40ebb7220e.png

We obtain probabilities by dividing each counter to the total:

probs = [count / sum(counts) for count in counts]

print(f"{'Code':>10} {'Char':>10} {'Count':>10} {'Prob':>10}")
for i, (count, prob) in enumerate(zip(counts, probs)):
    if chr(i).isalpha():
        print(f"{i:>10d}{i:>10c}{count:>10d}{prob:>20.6f}")
      Code       Char      Count       Prob
        65         A       147            0.008960
        66         B        88            0.005364
        67         C        57            0.003474
        68         D        63            0.003840
        69         E       110            0.006704
        70         F        29            0.001768
        71         G        58            0.003535
        72         H        45            0.002743
        73         I       228            0.013897
        74         J         3            0.000183
        75         K         5            0.000305
        76         L        54            0.003291
        77         M        44            0.002682
        78         N        59            0.003596
        79         O       190            0.011580
        80         P        19            0.001158
        81         Q         0            0.000000
        82         R       111            0.006765
        83         S        70            0.004266
        84         T       128            0.007802
        85         U        12            0.000731
        86         V        13            0.000792
        87         W        48            0.002926
        88         X         1            0.000061
        89         Y        22            0.001341
        90         Z         2            0.000122
        97         a       697            0.042482
        98         b       127            0.007741
        99         c       202            0.012312
       100         d       360            0.021942
       101         e      1164            0.070945
       102         f       211            0.012860
       103         g       181            0.011032
       104         h       632            0.038520
       105         i       615            0.037484
       106         j         7            0.000427
       107         k        75            0.004571
       108         l       355            0.021637
       109         m       262            0.015969
       110         n       624            0.038033
       111         o       863            0.052600
       112         p       134            0.008167
       113         q         7            0.000427
       114         r       613            0.037362
       115         s       653            0.039800
       116         t       827            0.050405
       117         u       363            0.022125
       118         v       117            0.007131
       119         w       189            0.011519
       120         x         8            0.000488
       121         y       237            0.014445
       122         z         3            0.000183
       170         ª         0            0.000000
       181         µ         0            0.000000
       186         º         0            0.000000
       192         À         0            0.000000
       193         Á         0            0.000000
       194         Â         0            0.000000
       195         Ã         0            0.000000
       196         Ä         0            0.000000
       197         Å         0            0.000000
       198         Æ         0            0.000000
       199         Ç         0            0.000000
       200         È         0            0.000000
       201         É         0            0.000000
       202         Ê         0            0.000000
       203         Ë         0            0.000000
       204         Ì         0            0.000000
       205         Í         0            0.000000
       206         Î         0            0.000000
       207         Ï         0            0.000000
       208         Ð         0            0.000000
       209         Ñ         0            0.000000
       210         Ò         0            0.000000
       211         Ó         0            0.000000
       212         Ô         0            0.000000
       213         Õ         0            0.000000
       214         Ö         0            0.000000
       216         Ø         0            0.000000
       217         Ù         0            0.000000
       218         Ú         0            0.000000
       219         Û         0            0.000000
       220         Ü         0            0.000000
       221         Ý         0            0.000000
       222         Þ         0            0.000000
       223         ß         0            0.000000
       224         à         0            0.000000
       225         á         0            0.000000
       226         â         0            0.000000
       227         ã         0            0.000000
       228         ä         0            0.000000
       229         å         0            0.000000
       230         æ         0            0.000000
       231         ç         0            0.000000
       232         è         0            0.000000
       233         é         0            0.000000
       234         ê         0            0.000000
       235         ë         0            0.000000
       236         ì         0            0.000000
       237         í         0            0.000000
       238         î         0            0.000000
       239         ï         0            0.000000
       240         ð         0            0.000000
       241         ñ         0            0.000000
       242         ò         0            0.000000
       243         ó         0            0.000000
       244         ô         0            0.000000
       245         õ         0            0.000000
       246         ö         0            0.000000
       248         ø         0            0.000000
       249         ù         0            0.000000
       250         ú         0            0.000000
       251         û         0            0.000000
       252         ü         0            0.000000
       253         ý         0            0.000000
       254         þ         0            0.000000
       255         ÿ         0            0.000000

Finally, let’s compute the entropy. We skip all frequencies equal to zero, since they don’t contribute to the entropy.

from math import log2

entropy = sum([-p * log2(p) for p in probs if p > 0])
print(f"Entropy = {entropy:.2f} bits per byte")
Entropy = 4.53 bits per byte

8.3. Function#

We will encapsulate the code in a function, so that we can reuse it conveniently

def entropy_of_file(filename):
    counts = [0] * 256
    for byte in pathlib.Path(filename).read_bytes():
        counts[byte] += 1
    probs = [count / sum(counts) for count in counts]
    return sum([-p * log2(p) for p in probs if p > 0])

Now let’s compute the entropy of various files:

entropy_of_file("data/texten.txt")
4.534092589335551
entropy_of_file("data/textro.txt")
4.569350422640342
entropy_of_file("data/Ceahlau.jpg")
7.96649855933222
entropy_of_file("data/ChromeSetup.exe")
7.898282999757679