# calculus.infotheory package¶

## calculus.infotheory.entropy module¶

filename

sppas.src.calculus.infotheory.entropy.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Entropy estimator.

class calculus.infotheory.entropy.sppasEntropy(symbols, n=1)[source]

Bases: `object`

Entropy estimation.

Entropy is a measure of unpredictability of information content. Entropy is one of several ways to measure diversity.

If we want to look at the entropy on a large series, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.

__init__(symbols, n=1)[source]

Create a sppasEntropy instance with a list of symbols.

Parameters
• symbols – (list) a vector of symbols of any type.

• n – (int) n value for n-gram estimation. n ranges 1..MAX_NGRAM

eval()[source]

Estimate the Shannon entropy of a vector of symbols.

Shannon’s entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable).

Returns

(float) entropy value

set_ngram(n)[source]

Set the n value of n-grams.

Parameters

n – (int) n value for n-gram estimation. n ranges 1..8

set_symbols(symbols)[source]

Set the list of symbols.

Parameters

symbols – (list) a vector of symbols of any type.

## calculus.infotheory.kullbackleibler module¶

filename

sppas.src.calculus.infotheory.kullbackleibler.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

KLD estimator.

Bases: `object`

Kullback-Leibler distance estimator.

In probability theory and information theory, the Kullback–Leibler divergence (also called relative entropy) is a measure of the difference between two probability distributions P and Q. It is not symmetric in P and Q.

Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P‖Q), is a measure of the information gained when one revises ones beliefs from the prior probability distribution Q to the posterior probability distribution P.

However, the sppasKullbackLeibler class estimates the KL distance, i.e. the symmetric Kullback-Leibler divergence.

This sppasKullbackLeibler class implements the distance estimation between a model and the content of a moving window on data, as described in:

Brigitte Bigi, Renato De Mori, Marc El-Bèze, Thierry Spriet (1997). Combined models for topic spotting and topic-dependent language modeling IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (ASRU), Edited by S. Furui, B. H. Huang and Wu Chu, IEEE Signal Processing Society Publ, NY, pages 535-542.

This KL distance can also be used to estimate the distance between documents for text categorization, as proposed in:

Brigitte Bigi (2003). Using Kullback-Leibler Distance for Text Categorization. Lecture Notes in Computer Science, Advances in Information Retrieval, ISSN 0302-9743, Fabrizio Sebastiani (Editor), Springer-Verlag (Publisher), pages 305–319, Pisa (Italy).

In this class…

A model is a dictionary with:

• key is an n-gram,

• value is a probability.

The window of observed symbols is represented as a list of n-grams.

DEFAULT_EPSILON = 1e-06
__init__(model=None, observations=None)[source]

Create a sppasKullbackLeibler instance from a list of symbols.

Parameters
• model – (dict) a dictionary with key=item, value=probability

• observations – list ob observed items

eval_kld()[source]

Estimate the KL distance between a model and observations.

Returns

float value

get_epsilon()[source]

Return the epsilon value.

get_model()[source]

Return the model.

set_epsilon(eps)[source]

Fix the linear back-off value for unknown observations.

The optimal value for this coefficient is the product of the size of both model and observations to estimate the KL. This value must be significantly lower than the minimum value in the model.

Parameters

eps – (float) Epsilon value.

If eps is set to 0, a default value will be assigned.

set_model(model)[source]

Set the model.

Parameters

model – (dict) Probability distribution of the model.

set_model_from_data(data)[source]

Set the model from a given set of observations.

Parameters

data – (list) List of observed items.

set_observations(observations)[source]

Fix the set of observed items.

Parameters

observations – (list) The list of observed items.

## calculus.infotheory.perplexity module¶

filename

sppas.src.calculus.infotheory.perplexity.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Perplexity estimator.

class calculus.infotheory.perplexity.sppasPerplexity(model, ngram=1)[source]

Bases: `object`

Perplexity estimator.

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. The perplexity of a discrete probability distribution p is defined as: 2^{H(p)}=2^{-sum_x p(x)log_2 p(x)} where H(p) is the entropy of the distribution and x ranges over events.

Perplexity is commonly used to compare models on the same list of symbols - this list of symbols is “representing” the problem we are facing one. The higher perplexity, the better model.

A model is represented as a distribution of probabilities: the key is representing the symbol and the value is the the probability.

>>>model = dict() >>>model[“peer”] = 0.1 >>>model[“pineapple”] = 0.2 >>>model[“tomato”] = 0.3 >>>model[“apple”] = 0.4 >>>pp = sppasPerplexity(model)

The observation on which the perplexity must be estimated on is represented as a list: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>3.61531387398

A higher adequacy between the model and the observed sequence implies an higher perplexity value: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘tomato’] >>>pp.eval_perplexity(observed) >>>4.12106658324

It is possible that an observed item isn’t in the model… Then, the perplexity value is lower (because of an higher entropy). An epsilon probability is assigned to missing symbols. >>>observed=[‘apple’, ‘grapefruit’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>2.62034217479

DEFAULT_EPSILON = 1e-06
__init__(model, ngram=1)[source]

Create a Perplexity instance with a list of symbols.

Parameters
• model – (dict) a dictionary with key=item, value=probability

• ngram – (int) the n value, in the range 1..8

eval_pp(symbols)[source]

Estimate the perplexity of a list of symbols.

Returns

float value

set_epsilon(eps=0.0)[source]

Set a value for epsilon.

This value must be significantly lower than the minimum value in the model.

Parameters

eps – (float) new epsilon value.

If eps is set to 0, a default value will be assigned.

set_model(model)[source]

Set the probability distribution to the model.

Notice that the epsilon value is re-assigned.

Parameters

model – (dict) Dictionary with symbols as keys and values as

probabilities.

set_ngram(n)[source]

Set the n value of n-grams.

Parameters

n – (int) Value ranging from 1 to MAX_GRAM

## calculus.infotheory.utilit module¶

filename

sppas.src.calculus.infotheory.utilit.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Utilities for the information theory package.

calculus.infotheory.utilit.find_ngrams(symbols, ngram)[source]

Return a list of n-grams from a list of symbols.

Parameters
• symbols – (list)

• ngram – (int) n value for the ngrams

Returns

list of tuples

Example:

>>>symbols=[0,1,0,1,1,1,0] >>>print(find_ngrams(symbols, 2)) >>>[(0, 1), (1, 0), (0, 1), (1, 1), (1, 1), (1, 0)]

calculus.infotheory.utilit.log2(x)[source]

Parameters

x – (int, float) value

Returns

(float)

calculus.infotheory.utilit.symbols_to_items(symbols, ngram)[source]

Convert a list of symbols into a dictionary of items.

Example:

>>>symbols=[0, 1, 0, 1, 1, 1, 0] >>>print symbols_to_items(symbols,2) >>>{(0, 1): 2, (1, 0): 2, (1, 1): 2}

Returns

dictionary with key=tuple of symbols, value=number of occurrences

## Module contents¶

filename

sppas.src.calculus.infotheory.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Package for information theory calculus.

Information Theory is a scientific fields that started with the Claude Shannon’s 1948 paper: “A Mathematical Theory of Communication” published in the Bell Systems Technical Journal. There are several major concepts in this paper, including:

1. every communication channel has a speed limit, measured in binary digits per second, 2. the architecture and design of communication systems, 3. source coding, i.e. the efficiency of the data representation (remove redundancy in the information to make the message smaller)

calculus.infotheory.find_ngrams(symbols, ngram)[source]

Return a list of n-grams from a list of symbols.

Parameters
• symbols – (list)

• ngram – (int) n value for the ngrams

Returns

list of tuples

Example:

>>>symbols=[0,1,0,1,1,1,0] >>>print(find_ngrams(symbols, 2)) >>>[(0, 1), (1, 0), (0, 1), (1, 1), (1, 1), (1, 0)]

class calculus.infotheory.sppasEntropy(symbols, n=1)[source]

Bases: `object`

Entropy estimation.

Entropy is a measure of unpredictability of information content. Entropy is one of several ways to measure diversity.

If we want to look at the entropy on a large series, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.

__init__(symbols, n=1)[source]

Create a sppasEntropy instance with a list of symbols.

Parameters
• symbols – (list) a vector of symbols of any type.

• n – (int) n value for n-gram estimation. n ranges 1..MAX_NGRAM

eval()[source]

Estimate the Shannon entropy of a vector of symbols.

Shannon’s entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable).

Returns

(float) entropy value

set_ngram(n)[source]

Set the n value of n-grams.

Parameters

n – (int) n value for n-gram estimation. n ranges 1..8

set_symbols(symbols)[source]

Set the list of symbols.

Parameters

symbols – (list) a vector of symbols of any type.

Bases: `object`

Kullback-Leibler distance estimator.

In probability theory and information theory, the Kullback–Leibler divergence (also called relative entropy) is a measure of the difference between two probability distributions P and Q. It is not symmetric in P and Q.

Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P‖Q), is a measure of the information gained when one revises ones beliefs from the prior probability distribution Q to the posterior probability distribution P.

However, the sppasKullbackLeibler class estimates the KL distance, i.e. the symmetric Kullback-Leibler divergence.

This sppasKullbackLeibler class implements the distance estimation between a model and the content of a moving window on data, as described in:

Brigitte Bigi, Renato De Mori, Marc El-Bèze, Thierry Spriet (1997). Combined models for topic spotting and topic-dependent language modeling IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (ASRU), Edited by S. Furui, B. H. Huang and Wu Chu, IEEE Signal Processing Society Publ, NY, pages 535-542.

This KL distance can also be used to estimate the distance between documents for text categorization, as proposed in:

Brigitte Bigi (2003). Using Kullback-Leibler Distance for Text Categorization. Lecture Notes in Computer Science, Advances in Information Retrieval, ISSN 0302-9743, Fabrizio Sebastiani (Editor), Springer-Verlag (Publisher), pages 305–319, Pisa (Italy).

In this class…

A model is a dictionary with:

• key is an n-gram,

• value is a probability.

The window of observed symbols is represented as a list of n-grams.

DEFAULT_EPSILON = 1e-06
__init__(model=None, observations=None)[source]

Create a sppasKullbackLeibler instance from a list of symbols.

Parameters
• model – (dict) a dictionary with key=item, value=probability

• observations – list ob observed items

eval_kld()[source]

Estimate the KL distance between a model and observations.

Returns

float value

get_epsilon()[source]

Return the epsilon value.

get_model()[source]

Return the model.

set_epsilon(eps)[source]

Fix the linear back-off value for unknown observations.

The optimal value for this coefficient is the product of the size of both model and observations to estimate the KL. This value must be significantly lower than the minimum value in the model.

Parameters

eps – (float) Epsilon value.

If eps is set to 0, a default value will be assigned.

set_model(model)[source]

Set the model.

Parameters

model – (dict) Probability distribution of the model.

set_model_from_data(data)[source]

Set the model from a given set of observations.

Parameters

data – (list) List of observed items.

set_observations(observations)[source]

Fix the set of observed items.

Parameters

observations – (list) The list of observed items.

class calculus.infotheory.sppasPerplexity(model, ngram=1)[source]

Bases: `object`

Perplexity estimator.

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. The perplexity of a discrete probability distribution p is defined as: 2^{H(p)}=2^{-sum_x p(x)log_2 p(x)} where H(p) is the entropy of the distribution and x ranges over events.

Perplexity is commonly used to compare models on the same list of symbols - this list of symbols is “representing” the problem we are facing one. The higher perplexity, the better model.

A model is represented as a distribution of probabilities: the key is representing the symbol and the value is the the probability.

>>>model = dict() >>>model[“peer”] = 0.1 >>>model[“pineapple”] = 0.2 >>>model[“tomato”] = 0.3 >>>model[“apple”] = 0.4 >>>pp = sppasPerplexity(model)

The observation on which the perplexity must be estimated on is represented as a list: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>3.61531387398

A higher adequacy between the model and the observed sequence implies an higher perplexity value: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘tomato’] >>>pp.eval_perplexity(observed) >>>4.12106658324

It is possible that an observed item isn’t in the model… Then, the perplexity value is lower (because of an higher entropy). An epsilon probability is assigned to missing symbols. >>>observed=[‘apple’, ‘grapefruit’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>2.62034217479

DEFAULT_EPSILON = 1e-06
__init__(model, ngram=1)[source]

Create a Perplexity instance with a list of symbols.

Parameters
• model – (dict) a dictionary with key=item, value=probability

• ngram – (int) the n value, in the range 1..8

eval_pp(symbols)[source]

Estimate the perplexity of a list of symbols.

Returns

float value

set_epsilon(eps=0.0)[source]

Set a value for epsilon.

This value must be significantly lower than the minimum value in the model.

Parameters

eps – (float) new epsilon value.

If eps is set to 0, a default value will be assigned.

set_model(model)[source]

Set the probability distribution to the model.

Notice that the epsilon value is re-assigned.

Parameters

model – (dict) Dictionary with symbols as keys and values as

probabilities.

set_ngram(n)[source]

Set the n value of n-grams.

Parameters

n – (int) Value ranging from 1 to MAX_GRAM

calculus.infotheory.symbols_to_items(symbols, ngram)[source]

Convert a list of symbols into a dictionary of items.

Example:

>>>symbols=[0, 1, 0, 1, 1, 1, 0] >>>print symbols_to_items(symbols,2) >>>{(0, 1): 2, (1, 0): 2, (1, 1): 2}

Returns

dictionary with key=tuple of symbols, value=number of occurrences