calculus.infotheory package¶
Submodules¶
calculus.infotheory.entropy module¶
- filename
sppas.src.calculus.infotheory.entropy.py
- author
Brigitte Bigi
- contact
- summary
Entropy estimator.
- class calculus.infotheory.entropy.sppasEntropy(symbols, n=1)[source]¶
Bases:
object
Entropy estimation.
Entropy is a measure of unpredictability of information content. Entropy is one of several ways to measure diversity.
If we want to look at the entropy on a large series, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.
- __init__(symbols, n=1)[source]¶
Create a sppasEntropy instance with a list of symbols.
- Parameters
symbols – (list) a vector of symbols of any type.
n – (int) n value for n-gram estimation. n ranges 1..MAX_NGRAM
- eval()[source]¶
Estimate the Shannon entropy of a vector of symbols.
Shannon’s entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable).
- Returns
(float) entropy value
calculus.infotheory.kullbackleibler module¶
- filename
sppas.src.calculus.infotheory.kullbackleibler.py
- author
Brigitte Bigi
- contact
- summary
KLD estimator.
- class calculus.infotheory.kullbackleibler.sppasKullbackLeibler(model=None, observations=None)[source]¶
Bases:
object
Kullback-Leibler distance estimator.
In probability theory and information theory, the Kullback–Leibler divergence (also called relative entropy) is a measure of the difference between two probability distributions P and Q. It is not symmetric in P and Q.
Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P‖Q), is a measure of the information gained when one revises ones beliefs from the prior probability distribution Q to the posterior probability distribution P.
However, the sppasKullbackLeibler class estimates the KL distance, i.e. the symmetric Kullback-Leibler divergence.
This sppasKullbackLeibler class implements the distance estimation between a model and the content of a moving window on data, as described in:
Brigitte Bigi, Renato De Mori, Marc El-Bèze, Thierry Spriet (1997). Combined models for topic spotting and topic-dependent language modeling IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (ASRU), Edited by S. Furui, B. H. Huang and Wu Chu, IEEE Signal Processing Society Publ, NY, pages 535-542.
This KL distance can also be used to estimate the distance between documents for text categorization, as proposed in:
Brigitte Bigi (2003). Using Kullback-Leibler Distance for Text Categorization. Lecture Notes in Computer Science, Advances in Information Retrieval, ISSN 0302-9743, Fabrizio Sebastiani (Editor), Springer-Verlag (Publisher), pages 305–319, Pisa (Italy).
In this class…
A model is a dictionary with:
key is an n-gram,
value is a probability.
The window of observed symbols is represented as a list of n-grams.
- DEFAULT_EPSILON = 1e-06¶
- __init__(model=None, observations=None)[source]¶
Create a sppasKullbackLeibler instance from a list of symbols.
- Parameters
model – (dict) a dictionary with key=item, value=probability
observations – list ob observed items
- set_epsilon(eps)[source]¶
Fix the linear back-off value for unknown observations.
The optimal value for this coefficient is the product of the size of both model and observations to estimate the KL. This value must be significantly lower than the minimum value in the model.
- Parameters
eps – (float) Epsilon value.
If eps is set to 0, a default value will be assigned.
- set_model(model)[source]¶
Set the model.
- Parameters
model – (dict) Probability distribution of the model.
calculus.infotheory.perplexity module¶
- filename
sppas.src.calculus.infotheory.perplexity.py
- author
Brigitte Bigi
- contact
- summary
Perplexity estimator.
- class calculus.infotheory.perplexity.sppasPerplexity(model, ngram=1)[source]¶
Bases:
object
Perplexity estimator.
Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. The perplexity of a discrete probability distribution p is defined as: 2^{H(p)}=2^{-sum_x p(x)log_2 p(x)} where H(p) is the entropy of the distribution and x ranges over events.
Perplexity is commonly used to compare models on the same list of symbols - this list of symbols is “representing” the problem we are facing one. The higher perplexity, the better model.
A model is represented as a distribution of probabilities: the key is representing the symbol and the value is the the probability.
>>>model = dict() >>>model[“peer”] = 0.1 >>>model[“pineapple”] = 0.2 >>>model[“tomato”] = 0.3 >>>model[“apple”] = 0.4 >>>pp = sppasPerplexity(model)
The observation on which the perplexity must be estimated on is represented as a list: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>3.61531387398
A higher adequacy between the model and the observed sequence implies an higher perplexity value: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘tomato’] >>>pp.eval_perplexity(observed) >>>4.12106658324
It is possible that an observed item isn’t in the model… Then, the perplexity value is lower (because of an higher entropy). An epsilon probability is assigned to missing symbols. >>>observed=[‘apple’, ‘grapefruit’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>2.62034217479
- DEFAULT_EPSILON = 1e-06¶
- __init__(model, ngram=1)[source]¶
Create a Perplexity instance with a list of symbols.
- Parameters
model – (dict) a dictionary with key=item, value=probability
ngram – (int) the n value, in the range 1..8
- set_epsilon(eps=0.0)[source]¶
Set a value for epsilon.
This value must be significantly lower than the minimum value in the model.
- Parameters
eps – (float) new epsilon value.
If eps is set to 0, a default value will be assigned.
calculus.infotheory.utilit module¶
- filename
sppas.src.calculus.infotheory.utilit.py
- author
Brigitte Bigi
- contact
- summary
Utilities for the information theory package.
- calculus.infotheory.utilit.find_ngrams(symbols, ngram)[source]¶
Return a list of n-grams from a list of symbols.
- Parameters
symbols – (list)
ngram – (int) n value for the ngrams
- Returns
list of tuples
Example:
>>>symbols=[0,1,0,1,1,1,0] >>>print(find_ngrams(symbols, 2)) >>>[(0, 1), (1, 0), (0, 1), (1, 1), (1, 1), (1, 0)]
- calculus.infotheory.utilit.log2(x)[source]¶
Estimate log in base 2.
- Parameters
x – (int, float) value
- Returns
(float)
- calculus.infotheory.utilit.symbols_to_items(symbols, ngram)[source]¶
Convert a list of symbols into a dictionary of items.
Example:
>>>symbols=[0, 1, 0, 1, 1, 1, 0] >>>print symbols_to_items(symbols,2) >>>{(0, 1): 2, (1, 0): 2, (1, 1): 2}
- Returns
dictionary with key=tuple of symbols, value=number of occurrences
Module contents¶
- filename
sppas.src.calculus.infotheory.py
- author
Brigitte Bigi
- contact
- summary
Package for information theory calculus.
Information Theory is a scientific fields that started with the Claude Shannon’s 1948 paper: “A Mathematical Theory of Communication” published in the Bell Systems Technical Journal. There are several major concepts in this paper, including:
1. every communication channel has a speed limit, measured in binary digits per second, 2. the architecture and design of communication systems, 3. source coding, i.e. the efficiency of the data representation (remove redundancy in the information to make the message smaller)
- calculus.infotheory.find_ngrams(symbols, ngram)[source]¶
Return a list of n-grams from a list of symbols.
- Parameters
symbols – (list)
ngram – (int) n value for the ngrams
- Returns
list of tuples
Example:
>>>symbols=[0,1,0,1,1,1,0] >>>print(find_ngrams(symbols, 2)) >>>[(0, 1), (1, 0), (0, 1), (1, 1), (1, 1), (1, 0)]
- class calculus.infotheory.sppasEntropy(symbols, n=1)[source]¶
Bases:
object
Entropy estimation.
Entropy is a measure of unpredictability of information content. Entropy is one of several ways to measure diversity.
If we want to look at the entropy on a large series, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.
- __init__(symbols, n=1)[source]¶
Create a sppasEntropy instance with a list of symbols.
- Parameters
symbols – (list) a vector of symbols of any type.
n – (int) n value for n-gram estimation. n ranges 1..MAX_NGRAM
- eval()[source]¶
Estimate the Shannon entropy of a vector of symbols.
Shannon’s entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable).
- Returns
(float) entropy value
- class calculus.infotheory.sppasKullbackLeibler(model=None, observations=None)[source]¶
Bases:
object
Kullback-Leibler distance estimator.
In probability theory and information theory, the Kullback–Leibler divergence (also called relative entropy) is a measure of the difference between two probability distributions P and Q. It is not symmetric in P and Q.
Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P‖Q), is a measure of the information gained when one revises ones beliefs from the prior probability distribution Q to the posterior probability distribution P.
However, the sppasKullbackLeibler class estimates the KL distance, i.e. the symmetric Kullback-Leibler divergence.
This sppasKullbackLeibler class implements the distance estimation between a model and the content of a moving window on data, as described in:
Brigitte Bigi, Renato De Mori, Marc El-Bèze, Thierry Spriet (1997). Combined models for topic spotting and topic-dependent language modeling IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (ASRU), Edited by S. Furui, B. H. Huang and Wu Chu, IEEE Signal Processing Society Publ, NY, pages 535-542.
This KL distance can also be used to estimate the distance between documents for text categorization, as proposed in:
Brigitte Bigi (2003). Using Kullback-Leibler Distance for Text Categorization. Lecture Notes in Computer Science, Advances in Information Retrieval, ISSN 0302-9743, Fabrizio Sebastiani (Editor), Springer-Verlag (Publisher), pages 305–319, Pisa (Italy).
In this class…
A model is a dictionary with:
key is an n-gram,
value is a probability.
The window of observed symbols is represented as a list of n-grams.
- DEFAULT_EPSILON = 1e-06¶
- __init__(model=None, observations=None)[source]¶
Create a sppasKullbackLeibler instance from a list of symbols.
- Parameters
model – (dict) a dictionary with key=item, value=probability
observations – list ob observed items
- set_epsilon(eps)[source]¶
Fix the linear back-off value for unknown observations.
The optimal value for this coefficient is the product of the size of both model and observations to estimate the KL. This value must be significantly lower than the minimum value in the model.
- Parameters
eps – (float) Epsilon value.
If eps is set to 0, a default value will be assigned.
- set_model(model)[source]¶
Set the model.
- Parameters
model – (dict) Probability distribution of the model.
- class calculus.infotheory.sppasPerplexity(model, ngram=1)[source]¶
Bases:
object
Perplexity estimator.
Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. The perplexity of a discrete probability distribution p is defined as: 2^{H(p)}=2^{-sum_x p(x)log_2 p(x)} where H(p) is the entropy of the distribution and x ranges over events.
Perplexity is commonly used to compare models on the same list of symbols - this list of symbols is “representing” the problem we are facing one. The higher perplexity, the better model.
A model is represented as a distribution of probabilities: the key is representing the symbol and the value is the the probability.
>>>model = dict() >>>model[“peer”] = 0.1 >>>model[“pineapple”] = 0.2 >>>model[“tomato”] = 0.3 >>>model[“apple”] = 0.4 >>>pp = sppasPerplexity(model)
The observation on which the perplexity must be estimated on is represented as a list: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>3.61531387398
A higher adequacy between the model and the observed sequence implies an higher perplexity value: >>>observed=[‘apple’, ‘pineapple’, ‘apple’, ‘tomato’] >>>pp.eval_perplexity(observed) >>>4.12106658324
It is possible that an observed item isn’t in the model… Then, the perplexity value is lower (because of an higher entropy). An epsilon probability is assigned to missing symbols. >>>observed=[‘apple’, ‘grapefruit’, ‘apple’, ‘peer’] >>>pp.eval_perplexity(observed) >>>2.62034217479
- DEFAULT_EPSILON = 1e-06¶
- __init__(model, ngram=1)[source]¶
Create a Perplexity instance with a list of symbols.
- Parameters
model – (dict) a dictionary with key=item, value=probability
ngram – (int) the n value, in the range 1..8
- set_epsilon(eps=0.0)[source]¶
Set a value for epsilon.
This value must be significantly lower than the minimum value in the model.
- Parameters
eps – (float) new epsilon value.
If eps is set to 0, a default value will be assigned.
- calculus.infotheory.symbols_to_items(symbols, ngram)[source]¶
Convert a list of symbols into a dictionary of items.
Example:
>>>symbols=[0, 1, 0, 1, 1, 1, 0] >>>print symbols_to_items(symbols,2) >>>{(0, 1): 2, (1, 0): 2, (1, 1): 2}
- Returns
dictionary with key=tuple of symbols, value=number of occurrences