annotations.Align.models.slm package¶
Submodules¶
annotations.Align.models.slm.arpaio module¶
- filename
sppas.src.annotations.Align.models.slm.arpaio.py
- author
Brigitte Bigi
- contact
- summary
I/O for ARPA models.
- class annotations.Align.models.slm.arpaio.sppasArpaIO[source]¶
Bases:
object
ARPA statistical language models reader/writer.
This class is able to load and save statistical language models from ARPA-ASCII files.
- load(filename)[source]¶
Load a model from an ARPA file.
- Parameters
filename – (str) Name of the file of the model.
annotations.Align.models.slm.ngramsmodel module¶
- filename
sppas.src.annotations.Align.models.scm.ngramsmodel.py
- author
Brigitte Bigi
- contact
- summary
Data structure for n-grams models and training.
- class annotations.Align.models.slm.ngramsmodel.sppasNgramCounter(n=1, wordslist=None)[source]¶
Bases:
object
N-gram representation.
- __init__(n=1, wordslist=None)[source]¶
Create a sppasNgramSounter instance.
- Parameters
n – (int) n-gram order, between 1 and MAX_ORDER.
wordslist – (sppasVocabulary) a list of accepted tokens.
- append_sentence(sentence)[source]¶
Append a sentence in a dictionary of data counts.
- Parameters
sentence – (str) A sentence with tokens separated by whitespace.
- count(*datafiles)[source]¶
Count ngrams of order n from data files.
- Parameters
datafiles – (*args) is a set of file names, with UTF-8 encoding.
If the file contains more than one tier, only the first one is used.
- get_count(sequence)[source]¶
Get the count of a specific sequence.
- Parameters
sequence – (str) tokens separated by whitespace.
- Returns
(int)
- get_ncount()[source]¶
Get the number of observed n-grams.
Excluding start symbols if unigrams.
- Returns
(int)
- class annotations.Align.models.slm.ngramsmodel.sppasNgramsModel(norder=1)[source]¶
Bases:
object
Statistical language model trainer.
A model is made of:
n-gram counts: a list of sppasNgramCounter instances.
n-gram probabilities.
How to estimate n-gram probabilities?
A slight bit of theory… The following is copied (cribbed!) from the SRILM following web page: http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
- a_z An N-gram where a is the first word, z is the last word, and “_”
represents 0 or more words in between.
c(a_z) The count of N-gram a_z in the training data
- p(a_z) The estimated conditional probability of the nth word z given the
first n-1 words (a_) of an N-gram.
a_ The n-1 word prefix of the N-gram a_z. _z The n-1 word suffix of the N-gram a_z.
N-gram models try to estimate the probability of a word z in the context of the previous n-1 words (a_). One way to estimate p(a_z) is to look at the number of times word z has followed the previous n-1 words (a_):
p(a_z) = c(a_z)/c(a_)
This is known as the maximum likelihood (ML) estimate. Notice that it assigns zero probability to N-grams that have not been observed in the training data.
To avoid the zero probabilities, we take some probability mass from the observed N-grams and distribute it to unobserved N-grams. Such redistribution is known as smoothing or discounting. Most existing smoothing algorithms can be described by the following equation:
p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)
If the N-gram a_z has been observed in the training data, we use the distribution f(a_z). Typically f(a_z) is discounted to be less than the ML estimate so we have some leftover probability for the z words unseen in the context (a_). Different algorithms mainly differ on how they discount the ML estimate to get f(a_z).
- Example
>>> # create a 3-gram model >>> model = sppasNgramsModel(3) >>> # count n-grams from data >>> model.count(*corpusfiles) >>> # estimates probas >>> probas = model.probabilities(method="logml")
Methods to estimates the probabilities:
raw: return counts instead of probabilities
lograw: idem with log values
ml: return maximum likelihood (un-smoothed probabilities)
logml: idem with log values
- __init__(norder=1)[source]¶
Create a sppasNgramsModel instance.
- Parameters
norder – (int) n-gram order, between 1 and MAX_ORDER.
- append_sentences(sentences)[source]¶
Append a list of sentences in data counts.
- Parameters
sentences – (list) sentences with tokens separated by whitespace.
- count(*datafiles)[source]¶
Count ngrams from data files.
- Parameters
datafiles – (*args) is a set of file names, with UTF-8 encoding.
If the file contains more than one tier, only the first one is used.
- probabilities(method='lograw')[source]¶
Return a list of probabilities.
- Parameters
method – (str) method to estimate probabilities
- Returns
list of n-gram probabilities.
- Example
>>> probas = probabilities("logml") >>> for t in probas[0]: >>> print(t) >>> ('</s>', -1.066946789630613, None) >>> ('<s>', -99.0, None) >>> (u'a', -0.3679767852945944, None) >>> (u'b', -0.5440680443502756, None) >>> (u'c', -0.9420080530223132, None) >>> (u'd', -1.066946789630613, None)
- set_end_symbol(symbol)[source]¶
Set the end sentence symbol.
- Parameters
symbol – (str) String to represent the end of a sentence.
- set_min_count(value)[source]¶
Fix a minimum count values, applied only to the max order.
Any observed n-gram with a count under the value is removed.
- Parameters
value – (int) Threshold for minimum count
annotations.Align.models.slm.statlangmodel module¶
- filename
sppas.src.annotations.Align.models.slm.statlangmodel.py
- author
Brigitte Bigi
- contact
- summary
Statistical language model representation and use.
- class annotations.Align.models.slm.statlangmodel.sppasSLM[source]¶
Bases:
object
Statistical language model representation.
- interpolate(other)[source]¶
Interpolate the model with another one.
An N-Gram language model can be constructed from a linear interpolation of several models. In this case, the overall likelihood P(w|h) of a word w occurring after the history h is computed as the arithmetic average of P(w|h) for each of the models.
The default interpolation method is linear interpolation. In addition, log-linear interpolation of models is possible.
- Parameters
other – (sppasSLM)
- load_from_arpa(filename)[source]¶
Load the model from an ARPA-ASCII file.
- Parameters
filename – (str) Filename from which to read the model.
Module contents¶
- filename
sppas.src.annotations.Align.models.slm.__init__.py
- author
Brigitte Bigi
- contact
- summary
slm is a package to manage Statistical Language Models.
- class annotations.Align.models.slm.sppasArpaIO[source]¶
Bases:
object
ARPA statistical language models reader/writer.
This class is able to load and save statistical language models from ARPA-ASCII files.
- load(filename)[source]¶
Load a model from an ARPA file.
- Parameters
filename – (str) Name of the file of the model.
- class annotations.Align.models.slm.sppasNgramCounter(n=1, wordslist=None)[source]¶
Bases:
object
N-gram representation.
- __init__(n=1, wordslist=None)[source]¶
Create a sppasNgramSounter instance.
- Parameters
n – (int) n-gram order, between 1 and MAX_ORDER.
wordslist – (sppasVocabulary) a list of accepted tokens.
- append_sentence(sentence)[source]¶
Append a sentence in a dictionary of data counts.
- Parameters
sentence – (str) A sentence with tokens separated by whitespace.
- count(*datafiles)[source]¶
Count ngrams of order n from data files.
- Parameters
datafiles – (*args) is a set of file names, with UTF-8 encoding.
If the file contains more than one tier, only the first one is used.
- get_count(sequence)[source]¶
Get the count of a specific sequence.
- Parameters
sequence – (str) tokens separated by whitespace.
- Returns
(int)
- get_ncount()[source]¶
Get the number of observed n-grams.
Excluding start symbols if unigrams.
- Returns
(int)
- class annotations.Align.models.slm.sppasNgramsModel(norder=1)[source]¶
Bases:
object
Statistical language model trainer.
A model is made of:
n-gram counts: a list of sppasNgramCounter instances.
n-gram probabilities.
How to estimate n-gram probabilities?
A slight bit of theory… The following is copied (cribbed!) from the SRILM following web page: http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
- a_z An N-gram where a is the first word, z is the last word, and “_”
represents 0 or more words in between.
c(a_z) The count of N-gram a_z in the training data
- p(a_z) The estimated conditional probability of the nth word z given the
first n-1 words (a_) of an N-gram.
a_ The n-1 word prefix of the N-gram a_z. _z The n-1 word suffix of the N-gram a_z.
N-gram models try to estimate the probability of a word z in the context of the previous n-1 words (a_). One way to estimate p(a_z) is to look at the number of times word z has followed the previous n-1 words (a_):
p(a_z) = c(a_z)/c(a_)
This is known as the maximum likelihood (ML) estimate. Notice that it assigns zero probability to N-grams that have not been observed in the training data.
To avoid the zero probabilities, we take some probability mass from the observed N-grams and distribute it to unobserved N-grams. Such redistribution is known as smoothing or discounting. Most existing smoothing algorithms can be described by the following equation:
p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)
If the N-gram a_z has been observed in the training data, we use the distribution f(a_z). Typically f(a_z) is discounted to be less than the ML estimate so we have some leftover probability for the z words unseen in the context (a_). Different algorithms mainly differ on how they discount the ML estimate to get f(a_z).
- Example
>>> # create a 3-gram model >>> model = sppasNgramsModel(3) >>> # count n-grams from data >>> model.count(*corpusfiles) >>> # estimates probas >>> probas = model.probabilities(method="logml")
Methods to estimates the probabilities:
raw: return counts instead of probabilities
lograw: idem with log values
ml: return maximum likelihood (un-smoothed probabilities)
logml: idem with log values
- __init__(norder=1)[source]¶
Create a sppasNgramsModel instance.
- Parameters
norder – (int) n-gram order, between 1 and MAX_ORDER.
- append_sentences(sentences)[source]¶
Append a list of sentences in data counts.
- Parameters
sentences – (list) sentences with tokens separated by whitespace.
- count(*datafiles)[source]¶
Count ngrams from data files.
- Parameters
datafiles – (*args) is a set of file names, with UTF-8 encoding.
If the file contains more than one tier, only the first one is used.
- probabilities(method='lograw')[source]¶
Return a list of probabilities.
- Parameters
method – (str) method to estimate probabilities
- Returns
list of n-gram probabilities.
- Example
>>> probas = probabilities("logml") >>> for t in probas[0]: >>> print(t) >>> ('</s>', -1.066946789630613, None) >>> ('<s>', -99.0, None) >>> (u'a', -0.3679767852945944, None) >>> (u'b', -0.5440680443502756, None) >>> (u'c', -0.9420080530223132, None) >>> (u'd', -1.066946789630613, None)
- set_end_symbol(symbol)[source]¶
Set the end sentence symbol.
- Parameters
symbol – (str) String to represent the end of a sentence.
- set_min_count(value)[source]¶
Fix a minimum count values, applied only to the max order.
Any observed n-gram with a count under the value is removed.
- Parameters
value – (int) Threshold for minimum count
- class annotations.Align.models.slm.sppasSLM[source]¶
Bases:
object
Statistical language model representation.
- interpolate(other)[source]¶
Interpolate the model with another one.
An N-Gram language model can be constructed from a linear interpolation of several models. In this case, the overall likelihood P(w|h) of a word w occurring after the history h is computed as the arithmetic average of P(w|h) for each of the models.
The default interpolation method is linear interpolation. In addition, log-linear interpolation of models is possible.
- Parameters
other – (sppasSLM)
- load_from_arpa(filename)[source]¶
Load the model from an ARPA-ASCII file.
- Parameters
filename – (str) Filename from which to read the model.