annotations.Align.models.slm package

Submodules

annotations.Align.models.slm.arpaio module

filename

sppas.src.annotations.Align.models.slm.arpaio.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

I/O for ARPA models.

class annotations.Align.models.slm.arpaio.sppasArpaIO[source]

Bases: object

ARPA statistical language models reader/writer.

This class is able to load and save statistical language models from ARPA-ASCII files.

__init__()[source]

Create a sppasArpaIO instance without model.

load(filename)[source]

Load a model from an ARPA file.

Parameters

filename – (str) Name of the file of the model.

save(filename)[source]

Save the model into a file, in ARPA-ASCII format.

The ARPA format:

data

ngram 1=nb1 ngram 2=nb2 … ngram N=nbN

1-grams: p(a_z) a_z bow(a_z) …

2-grams: p(a_z) a_z bow(a_z) …

n-grams: p(a_z) a_z …

end

Parameters

filename – (str) File where to save the model.

set(slm)[source]

Set the model of the sppasSLM.

Parameters

slm – (list) List of tuples for 1-gram, 2-grams, …

annotations.Align.models.slm.ngramsmodel module

filename

sppas.src.annotations.Align.models.scm.ngramsmodel.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Data structure for n-grams models and training.

class annotations.Align.models.slm.ngramsmodel.sppasNgramCounter(n=1, wordslist=None)[source]

Bases: object

N-gram representation.

__init__(n=1, wordslist=None)[source]

Create a sppasNgramSounter instance.

Parameters
  • n – (int) n-gram order, between 1 and MAX_ORDER.

  • wordslist – (sppasVocabulary) a list of accepted tokens.

append_sentence(sentence)[source]

Append a sentence in a dictionary of data counts.

Parameters

sentence – (str) A sentence with tokens separated by whitespace.

count(*datafiles)[source]

Count ngrams of order n from data files.

Parameters

datafiles – (*args) is a set of file names, with UTF-8 encoding.

If the file contains more than one tier, only the first one is used.

get_count(sequence)[source]

Get the count of a specific sequence.

Parameters

sequence – (str) tokens separated by whitespace.

Returns

(int)

get_ncount()[source]

Get the number of observed n-grams.

Excluding start symbols if unigrams.

Returns

(int)

get_ngram_count(ngram)[source]

Get the count of a specific ngram.

Parameters

ngram – (tuple of str) Tuple of tokens.

Returns

(int)

get_ngrams()[source]

Get the list of alphabetically-ordered n-grams.

Returns

list of tuples

shave(value)[source]

Remove data if count is lower than the given value.

Parameters

value – (int) Threshold value

class annotations.Align.models.slm.ngramsmodel.sppasNgramsModel(norder=1)[source]

Bases: object

Statistical language model trainer.

A model is made of:

  • n-gram counts: a list of sppasNgramCounter instances.

  • n-gram probabilities.

How to estimate n-gram probabilities?

A slight bit of theory… The following is copied (cribbed!) from the SRILM following web page: http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

a_z An N-gram where a is the first word, z is the last word, and “_”

represents 0 or more words in between.

c(a_z) The count of N-gram a_z in the training data

p(a_z) The estimated conditional probability of the nth word z given the

first n-1 words (a_) of an N-gram.

a_ The n-1 word prefix of the N-gram a_z. _z The n-1 word suffix of the N-gram a_z.

N-gram models try to estimate the probability of a word z in the context of the previous n-1 words (a_). One way to estimate p(a_z) is to look at the number of times word z has followed the previous n-1 words (a_):

  1. p(a_z) = c(a_z)/c(a_)

This is known as the maximum likelihood (ML) estimate. Notice that it assigns zero probability to N-grams that have not been observed in the training data.

To avoid the zero probabilities, we take some probability mass from the observed N-grams and distribute it to unobserved N-grams. Such redistribution is known as smoothing or discounting. Most existing smoothing algorithms can be described by the following equation:

  1. p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)

If the N-gram a_z has been observed in the training data, we use the distribution f(a_z). Typically f(a_z) is discounted to be less than the ML estimate so we have some leftover probability for the z words unseen in the context (a_). Different algorithms mainly differ on how they discount the ML estimate to get f(a_z).

Example
>>> # create a 3-gram model
>>> model = sppasNgramsModel(3)
>>> # count n-grams from data
>>> model.count(*corpusfiles)
>>> # estimates probas
>>> probas = model.probabilities(method="logml")

Methods to estimates the probabilities:

  • raw: return counts instead of probabilities

  • lograw: idem with log values

  • ml: return maximum likelihood (un-smoothed probabilities)

  • logml: idem with log values

__init__(norder=1)[source]

Create a sppasNgramsModel instance.

Parameters

norder – (int) n-gram order, between 1 and MAX_ORDER.

append_sentences(sentences)[source]

Append a list of sentences in data counts.

Parameters

sentences – (list) sentences with tokens separated by whitespace.

count(*datafiles)[source]

Count ngrams from data files.

Parameters

datafiles – (*args) is a set of file names, with UTF-8 encoding.

If the file contains more than one tier, only the first one is used.

get_order()[source]

Return the n-gram order value.

Returns

N-gram order integer value to assign.

probabilities(method='lograw')[source]

Return a list of probabilities.

Parameters

method – (str) method to estimate probabilities

Returns

list of n-gram probabilities.

Example
>>> probas = probabilities("logml")
>>> for t in probas[0]:
>>>      print(t)
>>> ('</s>', -1.066946789630613, None)
>>> ('<s>', -99.0, None)
>>> (u'a', -0.3679767852945944, None)
>>> (u'b', -0.5440680443502756, None)
>>> (u'c', -0.9420080530223132, None)
>>> (u'd', -1.066946789630613, None)
set_end_symbol(symbol)[source]

Set the end sentence symbol.

Parameters

symbol – (str) String to represent the end of a sentence.

set_min_count(value)[source]

Fix a minimum count values, applied only to the max order.

Any observed n-gram with a count under the value is removed.

Parameters

value – (int) Threshold for minimum count

set_start_symbol(symbol)[source]

Set the start sentence symbol.

Parameters

symbol – (str) String to represent the beginning of a sentence.

set_vocab(filename)[source]

Fix a list of accepted tokens; others are mentioned as unknown.

Parameters

filename – (str) List of tokens.

annotations.Align.models.slm.statlangmodel module

filename

sppas.src.annotations.Align.models.slm.statlangmodel.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Statistical language model representation and use.

class annotations.Align.models.slm.statlangmodel.sppasSLM[source]

Bases: object

Statistical language model representation.

__init__()[source]

Create a sppasSLM instance without model.

evaluate(filename)[source]

Evaluate a model on a file (perplexity).

interpolate(other)[source]

Interpolate the model with another one.

An N-Gram language model can be constructed from a linear interpolation of several models. In this case, the overall likelihood P(w|h) of a word w occurring after the history h is computed as the arithmetic average of P(w|h) for each of the models.

The default interpolation method is linear interpolation. In addition, log-linear interpolation of models is possible.

Parameters

other – (sppasSLM)

load_from_arpa(filename)[source]

Load the model from an ARPA-ASCII file.

Parameters

filename – (str) Filename from which to read the model.

save_as_arpa(filename)[source]

Save the model into an ARPA-ASCII file.

Parameters

filename – (str) Filename in which to write the model.

set(model)[source]

Set the language model.

Parameters

model – (list) List of lists of tuples for 1-gram, 2-grams, …

Module contents

filename

sppas.src.annotations.Align.models.slm.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

slm is a package to manage Statistical Language Models.

class annotations.Align.models.slm.sppasArpaIO[source]

Bases: object

ARPA statistical language models reader/writer.

This class is able to load and save statistical language models from ARPA-ASCII files.

__init__()[source]

Create a sppasArpaIO instance without model.

load(filename)[source]

Load a model from an ARPA file.

Parameters

filename – (str) Name of the file of the model.

save(filename)[source]

Save the model into a file, in ARPA-ASCII format.

The ARPA format:

data

ngram 1=nb1 ngram 2=nb2 … ngram N=nbN

1-grams: p(a_z) a_z bow(a_z) …

2-grams: p(a_z) a_z bow(a_z) …

n-grams: p(a_z) a_z …

end

Parameters

filename – (str) File where to save the model.

set(slm)[source]

Set the model of the sppasSLM.

Parameters

slm – (list) List of tuples for 1-gram, 2-grams, …

class annotations.Align.models.slm.sppasNgramCounter(n=1, wordslist=None)[source]

Bases: object

N-gram representation.

__init__(n=1, wordslist=None)[source]

Create a sppasNgramSounter instance.

Parameters
  • n – (int) n-gram order, between 1 and MAX_ORDER.

  • wordslist – (sppasVocabulary) a list of accepted tokens.

append_sentence(sentence)[source]

Append a sentence in a dictionary of data counts.

Parameters

sentence – (str) A sentence with tokens separated by whitespace.

count(*datafiles)[source]

Count ngrams of order n from data files.

Parameters

datafiles – (*args) is a set of file names, with UTF-8 encoding.

If the file contains more than one tier, only the first one is used.

get_count(sequence)[source]

Get the count of a specific sequence.

Parameters

sequence – (str) tokens separated by whitespace.

Returns

(int)

get_ncount()[source]

Get the number of observed n-grams.

Excluding start symbols if unigrams.

Returns

(int)

get_ngram_count(ngram)[source]

Get the count of a specific ngram.

Parameters

ngram – (tuple of str) Tuple of tokens.

Returns

(int)

get_ngrams()[source]

Get the list of alphabetically-ordered n-grams.

Returns

list of tuples

shave(value)[source]

Remove data if count is lower than the given value.

Parameters

value – (int) Threshold value

class annotations.Align.models.slm.sppasNgramsModel(norder=1)[source]

Bases: object

Statistical language model trainer.

A model is made of:

  • n-gram counts: a list of sppasNgramCounter instances.

  • n-gram probabilities.

How to estimate n-gram probabilities?

A slight bit of theory… The following is copied (cribbed!) from the SRILM following web page: http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

a_z An N-gram where a is the first word, z is the last word, and “_”

represents 0 or more words in between.

c(a_z) The count of N-gram a_z in the training data

p(a_z) The estimated conditional probability of the nth word z given the

first n-1 words (a_) of an N-gram.

a_ The n-1 word prefix of the N-gram a_z. _z The n-1 word suffix of the N-gram a_z.

N-gram models try to estimate the probability of a word z in the context of the previous n-1 words (a_). One way to estimate p(a_z) is to look at the number of times word z has followed the previous n-1 words (a_):

  1. p(a_z) = c(a_z)/c(a_)

This is known as the maximum likelihood (ML) estimate. Notice that it assigns zero probability to N-grams that have not been observed in the training data.

To avoid the zero probabilities, we take some probability mass from the observed N-grams and distribute it to unobserved N-grams. Such redistribution is known as smoothing or discounting. Most existing smoothing algorithms can be described by the following equation:

  1. p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)

If the N-gram a_z has been observed in the training data, we use the distribution f(a_z). Typically f(a_z) is discounted to be less than the ML estimate so we have some leftover probability for the z words unseen in the context (a_). Different algorithms mainly differ on how they discount the ML estimate to get f(a_z).

Example
>>> # create a 3-gram model
>>> model = sppasNgramsModel(3)
>>> # count n-grams from data
>>> model.count(*corpusfiles)
>>> # estimates probas
>>> probas = model.probabilities(method="logml")

Methods to estimates the probabilities:

  • raw: return counts instead of probabilities

  • lograw: idem with log values

  • ml: return maximum likelihood (un-smoothed probabilities)

  • logml: idem with log values

__init__(norder=1)[source]

Create a sppasNgramsModel instance.

Parameters

norder – (int) n-gram order, between 1 and MAX_ORDER.

append_sentences(sentences)[source]

Append a list of sentences in data counts.

Parameters

sentences – (list) sentences with tokens separated by whitespace.

count(*datafiles)[source]

Count ngrams from data files.

Parameters

datafiles – (*args) is a set of file names, with UTF-8 encoding.

If the file contains more than one tier, only the first one is used.

get_order()[source]

Return the n-gram order value.

Returns

N-gram order integer value to assign.

probabilities(method='lograw')[source]

Return a list of probabilities.

Parameters

method – (str) method to estimate probabilities

Returns

list of n-gram probabilities.

Example
>>> probas = probabilities("logml")
>>> for t in probas[0]:
>>>      print(t)
>>> ('</s>', -1.066946789630613, None)
>>> ('<s>', -99.0, None)
>>> (u'a', -0.3679767852945944, None)
>>> (u'b', -0.5440680443502756, None)
>>> (u'c', -0.9420080530223132, None)
>>> (u'd', -1.066946789630613, None)
set_end_symbol(symbol)[source]

Set the end sentence symbol.

Parameters

symbol – (str) String to represent the end of a sentence.

set_min_count(value)[source]

Fix a minimum count values, applied only to the max order.

Any observed n-gram with a count under the value is removed.

Parameters

value – (int) Threshold for minimum count

set_start_symbol(symbol)[source]

Set the start sentence symbol.

Parameters

symbol – (str) String to represent the beginning of a sentence.

set_vocab(filename)[source]

Fix a list of accepted tokens; others are mentioned as unknown.

Parameters

filename – (str) List of tokens.

class annotations.Align.models.slm.sppasSLM[source]

Bases: object

Statistical language model representation.

__init__()[source]

Create a sppasSLM instance without model.

evaluate(filename)[source]

Evaluate a model on a file (perplexity).

interpolate(other)[source]

Interpolate the model with another one.

An N-Gram language model can be constructed from a linear interpolation of several models. In this case, the overall likelihood P(w|h) of a word w occurring after the history h is computed as the arithmetic average of P(w|h) for each of the models.

The default interpolation method is linear interpolation. In addition, log-linear interpolation of models is possible.

Parameters

other – (sppasSLM)

load_from_arpa(filename)[source]

Load the model from an ARPA-ASCII file.

Parameters

filename – (str) Filename from which to read the model.

save_as_arpa(filename)[source]

Save the model into an ARPA-ASCII file.

Parameters

filename – (str) Filename in which to write the model.

set(model)[source]

Set the language model.

Parameters

model – (list) List of lists of tuples for 1-gram, 2-grams, …