Detect Language with Memory Sources

9. Detect Language with Memory Sources#

9.1. Introduction#

The goal is to recognize the language of a given text, from a known list of languages (Romanian and English).

We will build an information source model for each language, considering it as a source with memory of order $m=1$. This means we will estimate the probability of every letter given the previous letter. This is very characteristic for each language, and it provides a simple way to distinguish between them.

9.2. Walkthrough#

Given a large text file, we would like to build the transition matrix which contains the probability of each letter given the previous letter.

We consider only the letters from the (English) alphabet, and we ignore the case of the letters. This means there are 26 letters in all (from a to z).

Since memory is of order 1, the state is the previous letter. This means we have $N=26$ states, and we need to estimate a transition matrix of size $26 \times 26$.

We estimate the transition matrix by counting every pair of consecutive letters, then normalizing the counts to obtain probabilities.

9.2.1. Count pairs of letters#

import numpy as np

filename = "data/texten.txt"

# Read all text from then file
with open(filename, 'r') as file:
    text = file.read()

# Convert all text to lowercase:
text = text.lower()

# Prepare 26x26 matrix full of **ones**
# Why ones and not zeros?  Discuss later
T = np.ones((26, 26))

# Go through all the text
for i in range(len(text) - 1):

    # Previous character and current character
    prev = text[i]
    curr = text[i + 1]

    # We only consider letters
    if prev.isalpha() and curr.isalpha():

        # Convert characters to numbers = offset from 'a'
        row = ord(prev) - ord('a')
        col = ord(curr) - ord('a')

        # Increment counter at position (row, col)
        T[row, col] += 1

Let’s look at the matrix we have now. The display will show the values row by row.

array([[  1.,  44.,  23.,  26.,   3.,   6.,  47.,   1.,  31.,   1.,  14.,
         54.,  21., 191.,   1.,  14.,   1.,  79.,  64., 105.,  15.,  42.,
          9.,   1.,  25.,   1.],
       [ 40.,   3.,   1.,   2.,  47.,   1.,   1.,   2.,   9.,   1.,   1.,
         15.,   1.,   1.,  17.,   1.,   1.,  39.,   3.,   1.,  33.,   1.,
          1.,   1.,  16.,   1.],
       [ 38.,   1.,   3.,   1.,  53.,   1.,   1.,  47.,  21.,   1.,  10.,
          8.,   1.,   1.,  46.,   1.,   1.,   4.,   1.,  20.,   9.,   1.,
          1.,   1.,   7.,   1.],
       [ 19.,   1.,   1.,   2.,  72.,   1.,   5.,   2.,  24.,   1.,   1.,
          4.,   1.,   3.,  37.,   1.,   1.,   9.,  17.,   1.,  17.,   2.,
          2.,   1.,   7.,   1.],
       [ 53.,   3.,  19.,  47.,  36.,   9.,   5.,   3.,  14.,   2.,   2.,
         61.,  26., 109.,   4.,   5.,   4., 216.,  87.,  37.,   7.,  29.,
          8.,  10.,  10.,   1.],
       [ 18.,   1.,   1.,   1.,  20.,  15.,   1.,   1.,  22.,   1.,   1.,
          5.,   1.,   1.,  52.,   1.,   1.,  10.,   1.,   3.,   5.,   1.,
          1.,   1.,   2.,   1.],
       [ 10.,   1.,   1.,   1.,  35.,   1.,   1.,  38.,   8.,   1.,   1.,
          2.,   2.,  12.,  77.,   1.,   1.,  14.,   7.,   1.,   8.,   1.,
          1.,   1.,   1.,   1.],
       [120.,   1.,   1.,   1., 251.,   1.,   1.,   1., 106.,   1.,   1.,
          1.,   2.,   1.,  76.,   1.,   1.,   7.,   1.,  33.,   5.,   1.,
          1.,   1.,   6.,   1.],
       [ 48.,  10.,  48.,  16.,  36.,  27.,  61.,   1.,   2.,   1.,   5.,
         29.,  40., 129.,  68.,   7.,   2.,  38., 114.,  87.,   1.,  13.,
          1.,   1.,   1.,   3.],
       [  2.,   1.,   1.,   1.,   3.,   1.,   1.,   1.,   1.,   1.,   1.,
          1.,   1.,   1.,   3.,   1.,   1.,   1.,   1.,   1.,   6.,   1.,
          1.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,  29.,   1.,   1.,   1.,  10.,   1.,   1.,
          1.,   1.,  19.,   1.,   1.,   1.,   1.,   3.,   1.,   1.,   1.,
          1.,   1.,   1.,   1.],
       [ 24.,   1.,   2.,  25.,  44.,   8.,   2.,   1.,  37.,   1.,   1.,
         97.,   3.,   1.,  57.,   2.,   1.,   4.,   7.,   4.,   4.,   3.,
          1.,   1.,  13.,   1.],
       [ 55.,   5.,   1.,   1.,  74.,   1.,   1.,   1.,  11.,   1.,   1.,
          1.,   8.,   3.,  33.,   7.,   1.,   1.,  11.,   1.,  13.,   1.,
          1.,   1.,  42.,   1.],
       [ 24.,   3.,  22., 124.,  59.,   3.,  48.,   4.,  30.,   1.,   9.,
          5.,   1.,  10.,  84.,   1.,   1.,   1.,  37.,  79.,   2.,   1.,
          1.,   1.,   7.,   1.],
       [  5.,   9.,   4.,  47.,   4.,  78.,   2.,   2.,   5.,   2.,   7.,
         22.,  43., 102.,  34.,  11.,   1., 115.,  24.,  75., 183.,  16.,
         53.,   1.,   4.,   2.],
       [ 11.,   1.,   1.,   1.,  22.,   1.,   1.,   2.,  11.,   1.,   1.,
         14.,   1.,   1.,  21.,  12.,   1.,  40.,   3.,   4.,   9.,   1.,
          1.,   1.,   4.,   1.],
       [  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,
          1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   8.,   1.,
          1.,   1.,   1.,   1.],
       [ 72.,   3.,  14.,  12., 121.,   2.,   5.,   2.,  69.,   1.,   6.,
          8.,   9.,   4.,  73.,   2.,   1.,  10.,  49.,  24.,  14.,  14.,
          1.,   1.,  13.,   1.],
       [ 16.,   2.,   9.,   6., 104.,   3.,   1.,  32.,  54.,   1.,   2.,
          4.,   4.,   2.,  35.,  16.,   2.,   1.,  36.,  90.,  25.,   1.,
          6.,   1.,   1.,   1.],
       [ 21.,   1.,   4.,   1.,  86.,   1.,   2., 314.,  86.,   1.,   1.,
         10.,   1.,   1.,  74.,   1.,   1.,  30.,  19.,  17.,   9.,   1.,
          5.,   1.,  13.,   1.],
       [  5.,   3.,  16.,   9.,   9.,   4.,  19.,   1.,   8.,   1.,  11.,
         30.,   5.,  27.,   1.,  14.,   1.,  55.,  47.,  52.,   1.,   1.,
          1.,   1.,   1.,   1.],
       [  6.,   1.,   1.,   1., 100.,   1.,   1.,   1.,  21.,   1.,   1.,
          1.,   1.,   1.,   6.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,
          1.,   1.,   2.,   1.],
       [ 15.,   1.,   1.,   1.,  34.,   2.,   1.,  55.,  58.,   1.,   1.,
          1.,   1.,  11.,  22.,   1.,   1.,   5.,   7.,   1.,   1.,   1.,
          1.,   1.,   1.,   1.],
       [  2.,   1.,   1.,   1.,   3.,   1.,   1.,   1.,   4.,   1.,   1.,
          1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   4.,   1.,   1.,
          1.,   1.,   1.,   1.],
       [  2.,   1.,   1.,   1.,  11.,   1.,   1.,   1.,   2.,   1.,   1.,
          1.,   1.,   1.,  95.,   6.,   1.,   2.,   5.,   2.,   1.,   1.,
          2.,   1.,   1.,   1.],
       [  1.,   1.,   1.,   1.,   4.,   1.,   1.,   1.,   1.,   1.,   1.,
          1.,   1.,   1.,   3.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,
          1.,   1.,   1.,   1.]])

Let’s see the counts when the first letter is a (i.e. row 0 in the matrix):

T[0,:]

array([  1.,  44.,  23.,  26.,   3.,   6.,  47.,   1.,  31.,   1.,  14.,
        54.,  21., 191.,   1.,  14.,   1.,  79.,  64., 105.,  15.,  42.,
         9.,   1.,  25.,   1.])

Let’s see the counts where the second letter is h (i.e. column number (h - a) in the matrix):

T[:,ord('h') - ord('a')]

array([  1.,   2.,  47.,   2.,   3.,   1.,  38.,   1.,   1.,   1.,   1.,
         1.,   1.,   4.,   2.,   2.,   1.,   2.,  32., 314.,   1.,   1.,
        55.,   1.,   1.,   1.])

Let’s locate the pair of letters with the highest count.

maxpos = np.argmax(T)
maxrow, maxcol = np.unravel_index(maxpos, T.shape)
maxrow, maxcol

(np.int64(19), np.int64(7))

This means the 20th row and 8th column in the matrix (Python starts numbering from 0), which correspond to the pair t (20th letter) and h (8th letter).

The most frequent pair of letters is th.

9.2.2. Normalize the counts#

We need to turn counts into probabilities.

To do this, we divide each row in the matrix by the sum of that row, so that the row sums to 1.

T = np.array([row  / sum(row) for row in T])

Let’s look at the first row and check that it’s sum is 1:

print(T[0,:])
print(f"Sum of row 0 is {sum(T[0,:])}")

[0.00121951 0.05365854 0.02804878 0.03170732 0.00365854 0.00731707
 0.05731707 0.00121951 0.03780488 0.00121951 0.01707317 0.06585366
 0.02560976 0.23292683 0.00121951 0.01707317 0.00121951 0.09634146
 0.07804878 0.12804878 0.01829268 0.05121951 0.01097561 0.00121951
 0.0304878  0.00121951]
Sum of row 0 is 1.0

9.2.3. Compute the probability of a sequence given the model#

In order to compute the probability of a sequence of letters, we multiply the transition probabilities for each pair of consecutive letters. We shall ignore the probability of the first letter (which doesn’t have a previous letter), for simplicity.

For example, for the sequence `hello”, we need to multiply the following probabilities:

\[\begin{split} \begin{align*} P &= P_{he} \cdot P_{el} \cdot P_{ll} \cdot P_{lo} \\ &= P(h|e) \cdot P(e|l) \cdot P(l|l) \cdot P(l|o) \\ \end{align*} \end{split}\]

Because the probabilities are small, the result will quickly become too small to be represented by a floating point number. To avoid this, we use log-probabilities, and we add them instead of multiplying them: $$

\[\begin{align*} \log(P_{he}) + \log(P_{el}) + \log(P_{ll}) + \log(P_{lo}) \\ \end{align*}\]

$$ This works because all we want is to compare the result with one matrix to the result with another matrix, and the logarithm is a monotonically increasing function, so we can compare the log-probabilities directly instead of the probabilities themselves.

text = "hello"
logprob = 0
for i in range(len(text) - 1):
    prev = text[i]
    curr = text[i + 1]
    if prev.isalpha() and curr.isalpha():
        row = ord(prev) - ord('a')
        col = ord(curr) - ord('a')
        logprob = logprob + np.log(T[row, col])
logprob

np.float64(-6.560869971477094)

9.3. Experiments#

We will encapsulate the code in some functions, so that we can reuse it conveniently

def compute_transition_matrix(filename):
    with open(filename, 'r') as file:
        text = file.read()
    text = text.lower()
    T = np.ones((26, 26))                   ### 1's not 0's
    for i in range(len(text) - 1):
        prev = text[i]
        curr = text[i + 1]
        if prev.isalpha() and curr.isalpha():
            row = ord(prev) - ord('a')
            col = ord(curr) - ord('a')
            T[row, col] += 1
        T = np.array([row  / sum(row) for row in T])
    return T

def compute_logprob(text, T):
    text = text.lower()
    logprob = 0
    for i in range(len(text) - 1):
        prev = text[i]
        curr = text[i + 1]
        if prev.isalpha() and curr.isalpha():
            row = ord(prev) - ord('a')
            col = ord(curr) - ord('a')
            logprob += np.log(T[row, col])
    return logprob

def get_language(text, T_en, T_ro):
    logprob_en = compute_logprob(text, T_en)
    logprob_ro = compute_logprob(text, T_ro)
    return "English" if logprob_en > logprob_ro else "Romanian"

Let’s compute the transition matrix for English and Romanian.

Ten = compute_transition_matrix("data/texten.txt")
Tro = compute_transition_matrix("data/textro.txt")

Now let’s guess the language:

sentence = "Salut! Ce mai faci?"
get_language(sentence, Ten, Tro)

'Romanian'

sentence = "Scooby-Doo, where are you?"
get_language(sentence, Ten, Tro)

'English'