resources package

Submodules

resources.dictpron module

filename

sppas.src.resources.dictpron.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Pronunciation dictionary resource model.

class resources.dictpron.sppasDictPron(dict_filename=None, nodump=False)[source]

Bases: object

Pronunciation dictionary manager.

A pronunciation dictionary contains a list of tokens, each one with a list of possible pronunciations.

sppasDictPron can load the dictionary from an HTK-ASCII file. Each line of such file looks like the following:

acted [acted] { k t e d acted(2) [acted] { k t i d

The first columns indicates the tokens, eventually followed by the variant number into braces. The second column (with brackets) is ignored. It should contain the token. Other columns are the phones separated by whitespace. sppasDictPron accepts missing variant numbers, empty brackets, or missing brackets.

>>> d = sppasDictPron('eng.dict')
>>> d.add_pron('acted', '{ k t e')
>>> d.add_pron('acted', '{ k t i')

Then, the phonetization of a token can be accessed with get_pron() method:

>>> print(d.get_pron('acted'))
>>>{-k-t-e-d|{-k-t-i-d|{-k-t-e|{-k-t-i

The following convention is adopted to represent the pronunciation variants:

  • ‘-’ separates the phones (X-SAMPA standard)

  • ‘|’ separates the variants

Notice that tokens in the dict are case-insensitive.

__init__(dict_filename=None, nodump=False)[source]

Create a sppasDictPron instance.

A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load is divided by two or three.

Parameters
  • dict_filename – (str) Name of the file of the pronunciation dict

  • nodump – (bool) Create or not a dump file.

add_pron(token, pron)[source]

Add a token/pron to the dict.

Parameters
  • token – (str) Unicode string of the token to add

  • pron – (str) A pronunciation in which the phonemes are separated by whitespace

static format_token(entry)[source]

Remove the CR/LF, tabs, multiple spaces and others… and lowerise.

Parameters

entry – (str) a token

Returns

formatted token

get(entry, substitution='dummy')[source]

Return the pronunciations of an entry in the dictionary.

Parameters
  • entry – (str) A token to find in the dictionary

  • substitution – (str) String to return if token is missing of dict

Returns

unicode of the pronunciations or the substitution.

get_filename()[source]

Return the name of the file from which the dict comes from.

get_pron(entry)[source]

Return the pronunciations of an entry in the dictionary.

Parameters

entry – (str) A token to find in the dictionary

Returns

unicode of the pronunciations or the unknown stamp.

get_unkstamp()[source]

Return the unknown words stamp.

static ipa_to_sampa(conversion, ipa_entry)[source]

Convert a string in IPA to SAMPA.

Parameters
  • conversion – (dict)

  • ipa_entry – (str)

is_pron_of(entry, pron)[source]

Return True if pron is a pronunciation of entry.

Phonemes of pron are separated by “-“.

Parameters
  • entry – (str) A unicode token to find in the dictionary

  • pron – (str) A unicode pronunciation

Returns

bool

is_unk(entry)[source]

Return True if an entry is unknown (not in the dictionary).

Parameters

entry – (str) A token to find in the dictionary

Returns

bool

load(filename)[source]

Load a pronunciation dictionary.

Parameters

filename – (str) Pronunciation dictionary file name

load_from_ascii(filename)[source]

Load a pronunciation dictionary from an HTK-ASCII file.

Parameters

filename – (str) Pronunciation dictionary file name

load_from_pls(filename)[source]

Load a pronunciation dictionary from a pls file (xml).

xmlns=”http://www.w3.org/2005/01/pronunciation-lexicon

Parameters

filename – (str) Pronunciation dictionary file name

static load_sampa_ipa()[source]

Load the sampa-ipa conversion file.

Return it as a dict().

map_phones(map_table)[source]

Create a new dictionary by changing the phoneme strings.

Perform changes depending on a mapping table.

Parameters

map_table – (Mapping) A mapping table

Returns

a sppasDictPron instance with mapped phones

save_as_ascii(filename, with_variant_nb=True, with_filled_brackets=True)[source]

Save the pronunciation dictionary in HTK-ASCII format.

Parameters
  • filename – (str) Dictionary file name

  • with_variant_nb – (bool) Write the variant number or not

  • with_filled_brackets – (bool) Fill the bracket with the token

resources.dictrepl module

filename

sppas.src.resources.dictrepl.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Substitution table.

class resources.dictrepl.sppasDictRepl(dict_filename=None, nodump=False)[source]

Bases: object

A dictionary to manage automated replacements.

A dictionary with specific features for language resources. The main feature is that values are “accumulated”.

>>>d = sppasDictRepl() >>>d.add(“key”, “v1”) >>>d.add(“key”, “v2”) >>>d.get(“key”) >>>v1|v2 >>>d.is_value(“v1”) >>>True >>>d.is_value(“v1|v2”) >>>False

REPLACE_SEPARATOR = '|'
__init__(dict_filename=None, nodump=False)[source]

Create a sppasDictRepl instance.

Parameters
  • dict_filename – (str) The dictionary file name (2 columns)

  • nodump – (bool) Disable the creation of a dump file

A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load it is divided by two or three.

add(token, repl)[source]

Add a new key,value into the dict.

Add as a new pair or append the value to the existing one with a “|” used as separator.

Parameters
  • token – (str) string of the token to add

  • repl – (str) the replacement token

Both token and repl are converted to unicode (if any) and strip.

static format_token(entry)[source]

Remove the CR/LF, tabs, multiple spaces and others… and lower.

Parameters

entry – (str) a token

Returns

formatted token

get(entry, substitution='')[source]

Return the value of a key of the dictionary or substitution.

Parameters
  • entry – (str) A token to find in the dictionary

  • substitution – (str) String to return if token is missing of the dict

Returns

unicode of the replacement or the substitution.

get_filename()[source]

Return the name of the file from which the vocab comes from.

is_empty()[source]

Return True if there is no entry in the dictionary.

is_key(entry)[source]

Return True if entry is exactly a key in the dictionary.

Parameters

entry – (str) Unicode string.

is_unk(entry)[source]

Return True if entry is not a key in the dictionary.

Parameters

entry – (str) Unicode string.

is_value(entry)[source]

Return True if entry is a value in the dictionary.

Parameters

entry – (str) Unicode string.

is_value_of(key, entry)[source]

Return True if entry is a value of a given key in the dictionary.

Parameters
  • key – (str) Unicode string.

  • entry – (str) Unicode string.

load_from_ascii(filename)[source]

Load a replacement dictionary from an ascii file.

Parameters

filename – (str) Replacement dictionary file name

pop(entry)[source]

Remove an entry, as key.

Parameters

entry – (str) unicode string of the entry to remove

remove(entry)[source]

Remove an entry, as key or value.

Parameters

entry – (str) unicode string of the entry to remove

replace(key)[source]

Return the value of a key or None if key has no replacement.

replace_reversed(value)[source]

Return the key(s) of a value or an empty string.

Parameters

value – (str) value to search

Returns

a unicode string with all keys, separated by ‘_’, or an empty string if value does not exists.

save_as_ascii(filename)[source]

Save the replacement dictionary.

Parameters

filename – (str)

Returns

(bool)

resources.dumpfile module

filename

sppas.src.resources.dumpfile.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Dump file for resource models.

class resources.dumpfile.sppasDumpFile(filename, dump_extension='')[source]

Bases: object

Class to manage dump files.

A dump file is a binary version of an ASCII file. Its size is greater than the original ASCII one but the time to load it is divided by two or three.

DUMP_FILENAME_EXT = '.dump'
__init__(filename, dump_extension='')[source]

Create a sppasDumpFile instance.

Parameters
  • filename – (str) Name of the ASCII file.

  • dump_extension – (str) Extension of the dump file.

get_dump_extension()[source]

Return the extension of the dump version of filename.

get_dump_filename()[source]

Return the file name of the dump version of filename.

Returns

name of the dump file

has_dump()[source]

Test if a dump file exists for filename and if it is up-to-date.

Returns

(bool)

load_from_dump()[source]

Load the file from a dumped file.

Returns

loaded data or None

save_as_dump(data)[source]

Save the data as a dumped file.

Parameters

data – The data to save

Returns

(bool)

set_dump_extension(extension='')[source]

Fix the extension of the dump file.

Set to the default extension if the given extension is an empty string.

Parameters

extension – (str) Extension of the dump file (starting with or without the dot).

Raises

DumpExtensionError if extension of the dump file is the same as the ASCII file.

set_filename(filename)[source]

Fix the name of the ASCII file.

Parameters

filename – (str) Name of the ASCII file.

resources.mapping module

filename

sppas.src.resources.mapping.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Mapping table.

class resources.mapping.sppasMapping(dict_name=None)[source]

Bases: resources.dictrepl.sppasDictRepl

Class to manage mapping tables.

A mapping is an extended replacement dictionary. sppasMapping is used for the management of a mapping table of any set of strings.

DEFAULT_SEP = (';', ',', '\n', ' ', '.', '|', '+', '-')
__init__(dict_name=None)[source]

Create a new sppasMapping instance.

Parameters

dict_name – (str) file name with the mapping data (2 columns)

get_miss_symbol()[source]

Return the boolean value of reverse member.

get_reverse()[source]

Return the boolean value of reverse member.

map(mstr, delimiters=(';', ',', '\n', ' ', '.', '|', '+', '-'), separator='')[source]

Run the Mapping process on an input string.

Parameters
  • mstr – input string to map

  • delimiters – (list) list of character delimiters. Default is: [‘;’, ‘,’, ‘ ‘, ‘.’, ‘|’, ‘+’, ‘-‘]

  • separator – (char) used to separate parts of the mapped result

(when longest matching algorithm was used to map a string) :returns: a string

map_entry(entry)[source]

Map an entry (a key or a value).

Parameters

entry – (str) input string to map

Returns

mapped entry is a string

set_keep_miss(keep_miss)[source]

Fix the keep_miss option.

Parameters

keep_miss – (bool) If keep_miss is set to True, each missing

entry is kept without change; instead each missing entry is replaced by a specific symbol.

set_miss_symbol(symbol)[source]

Fix the symbol to be used if keep_miss is False.

Parameters

symbol – (str) US-ASCII symbol to be used in case of a symbol

is missing of the mapping table.

set_reverse(reverse)[source]

Fix the reverse option.

Parameters

reverse – (bool) If replace is set to True, the mapping will

replace value by key instead of replacing key by value.

resources.patterns module

filename

sppas.src.resources.patterns.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Pattern matching.

class resources.patterns.sppasPatterns[source]

Bases: object

Pattern matching.

Pattern matching aims at checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact.

Several pattern matching algorithms are implemented in this class. They allow to find an hypothesis pattern in a reference.

MAX_GAP = 4
MAX_NGRAM = 8
__init__()[source]

Create a new Pattern instance.

dp_matching(ref, hyp)[source]

Dynamic Programming alignment of ref and hyp.

The DP alignment algorithm performs a global minimization of a Levenshtein distance function which weights the cost of correct words, insertions, deletions and substitutions as 0, 3, 3 and 4 respectively.

See:
TIME WARPS, STRING EDITS, AND MACROMOLECULES:
THE THEORY AND PRACTICE OF SEQUENCE COMPARISON,
by Sankoff and Kruskal, ISBN 0-201-07809-0
get_gap()[source]

Return the gap value (int).

get_ngram()[source]

Return the n value for n-grams (int).

get_score()[source]

Return the score value (float).

ngram_alignments(ref, hyp)[source]

n-gram alignment of ref and hyp.

The algorithm is based on the finding of matching n-grams, in the range of a given gap. If 1-gram, keep only hypothesis items with a high confidence score. A gap of search has to be fixed. An interstice value ensure the gap between an item in the ref and in the hyp won’t be too far.

Parameters
  • ref – (list of tokens) List of references

  • hyp – (list of tuples) List of hypothesis with their scores

The scores are supposed to range in [0;1] values. :returns: List of alignments indexes as tuples (i_ref,i_hyp),

Example:

ref: w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12
| | | | | |
| | | | /
| | | | /

hyp: w0 w1 w2 wX w3 w5 w6 wX w9

Returned matches:

  • if n=3: [ (0,0), (1,1), (2,2) ]

  • if n=2: [(0, 0), (1, 1), (2, 2), (5, 5), (6, 6)]

  • if n=1, it depends on the scores in hyp and the value of the gap.

ngram_matches(ref, hyp)[source]

n-gram matches between ref and hyp.

Search for common n-gram sequences of hyp in ref. The scores are supposed to range in [0;1] values.

Parameters
  • ref – (list of tokens) List of references

  • hyp – (list of tuples) List of hypothesis with their scores

Returns

List of matching indexes as tuples (i_ref, i_hyp)

set_gap(g)[source]

Fix the value of the gap.

Parameters

g – (int) Value of the gap (0<g<MAX_GAP)

Raises

GapRangeError

set_ngram(n)[source]

Fix the value of n of the n-grams.

Parameters

n – (int) Value of n (1<n<MAX_NGRAM)

Raises

NgramRangeError

set_score(s)[source]

Fix the value of the score.

Parameters

s – (float) Value of the score (0<s<1)

Raises

ScoreRangeError

resources.resourcesexc module

filename

sppas.src.resources.resourcesexc.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Exceptions of the resources package.

exception resources.resourcesexc.DumpExtensionError(extension)[source]

Bases: ValueError

:ERROR 5030:.

The dump file can’t have the same extension as the ASCII file ({extension}).

__init__(extension)[source]
exception resources.resourcesexc.FileFormatError(line_number, filename)[source]

Bases: ValueError

:ERROR 5015:.

Read file failed at line number {number}: {string}.

__init__(line_number, filename)[source]
exception resources.resourcesexc.FileIOError(filename)[source]

Bases: Exception

:ERROR 5010:.

Error while trying to open and read the file: {name}.

__init__(filename)[source]
exception resources.resourcesexc.FileUnicodeError(filename)[source]

Bases: UnicodeDecodeError

:ERROR 5005:.

Encoding error while trying to read the file: {name}.

__init__(filename)[source]
exception resources.resourcesexc.GapRangeError(maxi, value)[source]

Bases: ValueError

:ERROR 5022:.

The gap value of pattern matching must range [0;{maximum}]. Got {observed}.

__init__(maxi, value)[source]
exception resources.resourcesexc.NgramRangeError(maxi, value)[source]

Bases: ValueError

:ERROR 5020:.

The n value of n-grams pattern matching must range [1;{maximum}]. Got {observed}.

__init__(maxi, value)[source]
exception resources.resourcesexc.PositiveValueError(count)[source]

Bases: ValueError

:ERROR 5040:.

The count value must be positive. Got ({count}).

__init__(count)[source]
exception resources.resourcesexc.ScoreRangeError(value)[source]

Bases: ValueError

:ERROR 5024:.

The score value of unigrams pattern matching must range [0;1]. Got {observed}.

__init__(value)[source]

resources.unigram module

filename

sppas.src.resources.unigram.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Data structure for a set of token/count.

class resources.unigram.sppasUnigram(filename=None, nodump=True)[source]

Bases: object

Class to represent a simple unigram: a set of token/count.

An unigram is commonly a data structure with tokens and their probabilities, and a back-off value. Is is a statistical language model. This class is a simplified version with only tokens and their occurrences.

Notice that tokens are case-sensitive.

__init__(filename=None, nodump=True)[source]

Create a sppasUnigram instance.

Parameters
  • filename – (str) Name of the file with words and counts (2 columns)

  • nodump – (bool) Disable the creation of a dump file

add(entry, value=1)[source]

Add or increment a token in the unigram.

Parameters
  • entry – (str) String of the token to add

  • value – (int) Value to increment the count

Raises

PositiveValueError

get_count(token)[source]

Return the count of a token.

Parameters

token – (str) The string of the token

get_sum()[source]

Return the sum of all counts (of all tokens).

get_tokens()[source]

Return a list with all tokens.

load_from_ascii(filename)[source]

Load a unigram from a file with two columns: word count.

Parameters

filename – (str) Name of the unigram ASCII file to read

save_as_ascii(filename)[source]

Save a unigram into a file with two columns: word freq.

Parameters

filename – (str) Name of the unigram ASCII file to write

Returns

(bool)

resources.vocab module

filename

sppas.src.resources.vocab.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

A vocabulary resource model.

class resources.vocab.sppasVocabulary(filename=None, nodump=False, case_sensitive=False)[source]

Bases: object

Class to represent a list of words.

__init__(filename=None, nodump=False, case_sensitive=False)[source]

Create a sppasVocabulary instance.

Parameters
  • filename – (str) Name of the file with the list of words.

  • nodump – (bool) Allows to disable the creation of a dump file.

  • case_sensitive – (bool) the list of word is case-sensitive or not

add(entry)[source]

Add an entry into the list except if the entry is already inside.

Parameters

entry – (str) The entry to add in the word list

Returns

(bool)

clear()[source]

Remove all entries of the vocabulary.

copy()[source]

Make a deep copy of the instance.

Returns

sppasVocabulary

get_filename()[source]

Return the name of the file from which the vocab comes from.

get_list()[source]

Return the list of entries, sorted in alpha-numeric order.

is_in(entry)[source]

Return True if entry is in the list.

Parameters

entry – (str)

is_unk(entry)[source]

Return True if entry is unknown (not in the list).

Parameters

entry – (str)

load_from_ascii(filename)[source]

Read words from a file: one per line.

Parameters

filename – (str)

save(filename)[source]

Save the list of words in a file.

:param filename (str) :returns: (bool)

resources.wordstrain module

filename

sppas.src.resources.wordstrain.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

A very simplified but multilingual lemmatizer.

class resources.wordstrain.sppasWordStrain(filename=None)[source]

Bases: resources.dictrepl.sppasDictRepl

Sort of basic lemmatization.

__init__(filename=None)[source]

Create a WordStain instance.

Parameters

filename – (str) 2 or 3 columns file with word/freq/wordstrain

load(filename)[source]

Load word substitutions from a file.

Replace the existing substitutions.

Parameters

filename – (str) 2 or 3 columns file with word/freq/replacement

Module contents

filename

sppas.src.resources.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Resource models of SPPAS.

resources: access and manage linguistic resources

This package includes classes to manage the data of linguistic types like lexicons, pronunciation dictionaries, patterns, etc.

Requires the following other packages:

  • config

class resources.sppasDictPron(dict_filename=None, nodump=False)[source]

Bases: object

Pronunciation dictionary manager.

A pronunciation dictionary contains a list of tokens, each one with a list of possible pronunciations.

sppasDictPron can load the dictionary from an HTK-ASCII file. Each line of such file looks like the following:

acted [acted] { k t e d acted(2) [acted] { k t i d

The first columns indicates the tokens, eventually followed by the variant number into braces. The second column (with brackets) is ignored. It should contain the token. Other columns are the phones separated by whitespace. sppasDictPron accepts missing variant numbers, empty brackets, or missing brackets.

>>> d = sppasDictPron('eng.dict')
>>> d.add_pron('acted', '{ k t e')
>>> d.add_pron('acted', '{ k t i')

Then, the phonetization of a token can be accessed with get_pron() method:

>>> print(d.get_pron('acted'))
>>>{-k-t-e-d|{-k-t-i-d|{-k-t-e|{-k-t-i

The following convention is adopted to represent the pronunciation variants:

  • ‘-’ separates the phones (X-SAMPA standard)

  • ‘|’ separates the variants

Notice that tokens in the dict are case-insensitive.

__init__(dict_filename=None, nodump=False)[source]

Create a sppasDictPron instance.

A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load is divided by two or three.

Parameters
  • dict_filename – (str) Name of the file of the pronunciation dict

  • nodump – (bool) Create or not a dump file.

add_pron(token, pron)[source]

Add a token/pron to the dict.

Parameters
  • token – (str) Unicode string of the token to add

  • pron – (str) A pronunciation in which the phonemes are separated by whitespace

static format_token(entry)[source]

Remove the CR/LF, tabs, multiple spaces and others… and lowerise.

Parameters

entry – (str) a token

Returns

formatted token

get(entry, substitution='dummy')[source]

Return the pronunciations of an entry in the dictionary.

Parameters
  • entry – (str) A token to find in the dictionary

  • substitution – (str) String to return if token is missing of dict

Returns

unicode of the pronunciations or the substitution.

get_filename()[source]

Return the name of the file from which the dict comes from.

get_pron(entry)[source]

Return the pronunciations of an entry in the dictionary.

Parameters

entry – (str) A token to find in the dictionary

Returns

unicode of the pronunciations or the unknown stamp.

get_unkstamp()[source]

Return the unknown words stamp.

static ipa_to_sampa(conversion, ipa_entry)[source]

Convert a string in IPA to SAMPA.

Parameters
  • conversion – (dict)

  • ipa_entry – (str)

is_pron_of(entry, pron)[source]

Return True if pron is a pronunciation of entry.

Phonemes of pron are separated by “-“.

Parameters
  • entry – (str) A unicode token to find in the dictionary

  • pron – (str) A unicode pronunciation

Returns

bool

is_unk(entry)[source]

Return True if an entry is unknown (not in the dictionary).

Parameters

entry – (str) A token to find in the dictionary

Returns

bool

load(filename)[source]

Load a pronunciation dictionary.

Parameters

filename – (str) Pronunciation dictionary file name

load_from_ascii(filename)[source]

Load a pronunciation dictionary from an HTK-ASCII file.

Parameters

filename – (str) Pronunciation dictionary file name

load_from_pls(filename)[source]

Load a pronunciation dictionary from a pls file (xml).

xmlns=”http://www.w3.org/2005/01/pronunciation-lexicon

Parameters

filename – (str) Pronunciation dictionary file name

static load_sampa_ipa()[source]

Load the sampa-ipa conversion file.

Return it as a dict().

map_phones(map_table)[source]

Create a new dictionary by changing the phoneme strings.

Perform changes depending on a mapping table.

Parameters

map_table – (Mapping) A mapping table

Returns

a sppasDictPron instance with mapped phones

save_as_ascii(filename, with_variant_nb=True, with_filled_brackets=True)[source]

Save the pronunciation dictionary in HTK-ASCII format.

Parameters
  • filename – (str) Dictionary file name

  • with_variant_nb – (bool) Write the variant number or not

  • with_filled_brackets – (bool) Fill the bracket with the token

class resources.sppasDictRepl(dict_filename=None, nodump=False)[source]

Bases: object

A dictionary to manage automated replacements.

A dictionary with specific features for language resources. The main feature is that values are “accumulated”.

>>>d = sppasDictRepl() >>>d.add(“key”, “v1”) >>>d.add(“key”, “v2”) >>>d.get(“key”) >>>v1|v2 >>>d.is_value(“v1”) >>>True >>>d.is_value(“v1|v2”) >>>False

REPLACE_SEPARATOR = '|'
__init__(dict_filename=None, nodump=False)[source]

Create a sppasDictRepl instance.

Parameters
  • dict_filename – (str) The dictionary file name (2 columns)

  • nodump – (bool) Disable the creation of a dump file

A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load it is divided by two or three.

add(token, repl)[source]

Add a new key,value into the dict.

Add as a new pair or append the value to the existing one with a “|” used as separator.

Parameters
  • token – (str) string of the token to add

  • repl – (str) the replacement token

Both token and repl are converted to unicode (if any) and strip.

static format_token(entry)[source]

Remove the CR/LF, tabs, multiple spaces and others… and lower.

Parameters

entry – (str) a token

Returns

formatted token

get(entry, substitution='')[source]

Return the value of a key of the dictionary or substitution.

Parameters
  • entry – (str) A token to find in the dictionary

  • substitution – (str) String to return if token is missing of the dict

Returns

unicode of the replacement or the substitution.

get_filename()[source]

Return the name of the file from which the vocab comes from.

is_empty()[source]

Return True if there is no entry in the dictionary.

is_key(entry)[source]

Return True if entry is exactly a key in the dictionary.

Parameters

entry – (str) Unicode string.

is_unk(entry)[source]

Return True if entry is not a key in the dictionary.

Parameters

entry – (str) Unicode string.

is_value(entry)[source]

Return True if entry is a value in the dictionary.

Parameters

entry – (str) Unicode string.

is_value_of(key, entry)[source]

Return True if entry is a value of a given key in the dictionary.

Parameters
  • key – (str) Unicode string.

  • entry – (str) Unicode string.

load_from_ascii(filename)[source]

Load a replacement dictionary from an ascii file.

Parameters

filename – (str) Replacement dictionary file name

pop(entry)[source]

Remove an entry, as key.

Parameters

entry – (str) unicode string of the entry to remove

remove(entry)[source]

Remove an entry, as key or value.

Parameters

entry – (str) unicode string of the entry to remove

replace(key)[source]

Return the value of a key or None if key has no replacement.

replace_reversed(value)[source]

Return the key(s) of a value or an empty string.

Parameters

value – (str) value to search

Returns

a unicode string with all keys, separated by ‘_’, or an empty string if value does not exists.

save_as_ascii(filename)[source]

Save the replacement dictionary.

Parameters

filename – (str)

Returns

(bool)

class resources.sppasDumpFile(filename, dump_extension='')[source]

Bases: object

Class to manage dump files.

A dump file is a binary version of an ASCII file. Its size is greater than the original ASCII one but the time to load it is divided by two or three.

DUMP_FILENAME_EXT = '.dump'
__init__(filename, dump_extension='')[source]

Create a sppasDumpFile instance.

Parameters
  • filename – (str) Name of the ASCII file.

  • dump_extension – (str) Extension of the dump file.

get_dump_extension()[source]

Return the extension of the dump version of filename.

get_dump_filename()[source]

Return the file name of the dump version of filename.

Returns

name of the dump file

has_dump()[source]

Test if a dump file exists for filename and if it is up-to-date.

Returns

(bool)

load_from_dump()[source]

Load the file from a dumped file.

Returns

loaded data or None

save_as_dump(data)[source]

Save the data as a dumped file.

Parameters

data – The data to save

Returns

(bool)

set_dump_extension(extension='')[source]

Fix the extension of the dump file.

Set to the default extension if the given extension is an empty string.

Parameters

extension – (str) Extension of the dump file (starting with or without the dot).

Raises

DumpExtensionError if extension of the dump file is the same as the ASCII file.

set_filename(filename)[source]

Fix the name of the ASCII file.

Parameters

filename – (str) Name of the ASCII file.

class resources.sppasMapping(dict_name=None)[source]

Bases: resources.dictrepl.sppasDictRepl

Class to manage mapping tables.

A mapping is an extended replacement dictionary. sppasMapping is used for the management of a mapping table of any set of strings.

DEFAULT_SEP = (';', ',', '\n', ' ', '.', '|', '+', '-')
__init__(dict_name=None)[source]

Create a new sppasMapping instance.

Parameters

dict_name – (str) file name with the mapping data (2 columns)

get_miss_symbol()[source]

Return the boolean value of reverse member.

get_reverse()[source]

Return the boolean value of reverse member.

map(mstr, delimiters=(';', ',', '\n', ' ', '.', '|', '+', '-'), separator='')[source]

Run the Mapping process on an input string.

Parameters
  • mstr – input string to map

  • delimiters – (list) list of character delimiters. Default is: [‘;’, ‘,’, ‘ ‘, ‘.’, ‘|’, ‘+’, ‘-‘]

  • separator – (char) used to separate parts of the mapped result

(when longest matching algorithm was used to map a string) :returns: a string

map_entry(entry)[source]

Map an entry (a key or a value).

Parameters

entry – (str) input string to map

Returns

mapped entry is a string

set_keep_miss(keep_miss)[source]

Fix the keep_miss option.

Parameters

keep_miss – (bool) If keep_miss is set to True, each missing

entry is kept without change; instead each missing entry is replaced by a specific symbol.

set_miss_symbol(symbol)[source]

Fix the symbol to be used if keep_miss is False.

Parameters

symbol – (str) US-ASCII symbol to be used in case of a symbol

is missing of the mapping table.

set_reverse(reverse)[source]

Fix the reverse option.

Parameters

reverse – (bool) If replace is set to True, the mapping will

replace value by key instead of replacing key by value.

class resources.sppasPatterns[source]

Bases: object

Pattern matching.

Pattern matching aims at checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact.

Several pattern matching algorithms are implemented in this class. They allow to find an hypothesis pattern in a reference.

MAX_GAP = 4
MAX_NGRAM = 8
__init__()[source]

Create a new Pattern instance.

dp_matching(ref, hyp)[source]

Dynamic Programming alignment of ref and hyp.

The DP alignment algorithm performs a global minimization of a Levenshtein distance function which weights the cost of correct words, insertions, deletions and substitutions as 0, 3, 3 and 4 respectively.

See:
TIME WARPS, STRING EDITS, AND MACROMOLECULES:
THE THEORY AND PRACTICE OF SEQUENCE COMPARISON,
by Sankoff and Kruskal, ISBN 0-201-07809-0
get_gap()[source]

Return the gap value (int).

get_ngram()[source]

Return the n value for n-grams (int).

get_score()[source]

Return the score value (float).

ngram_alignments(ref, hyp)[source]

n-gram alignment of ref and hyp.

The algorithm is based on the finding of matching n-grams, in the range of a given gap. If 1-gram, keep only hypothesis items with a high confidence score. A gap of search has to be fixed. An interstice value ensure the gap between an item in the ref and in the hyp won’t be too far.

Parameters
  • ref – (list of tokens) List of references

  • hyp – (list of tuples) List of hypothesis with their scores

The scores are supposed to range in [0;1] values. :returns: List of alignments indexes as tuples (i_ref,i_hyp),

Example:

ref: w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12
| | | | | |
| | | | /
| | | | /

hyp: w0 w1 w2 wX w3 w5 w6 wX w9

Returned matches:

  • if n=3: [ (0,0), (1,1), (2,2) ]

  • if n=2: [(0, 0), (1, 1), (2, 2), (5, 5), (6, 6)]

  • if n=1, it depends on the scores in hyp and the value of the gap.

ngram_matches(ref, hyp)[source]

n-gram matches between ref and hyp.

Search for common n-gram sequences of hyp in ref. The scores are supposed to range in [0;1] values.

Parameters
  • ref – (list of tokens) List of references

  • hyp – (list of tuples) List of hypothesis with their scores

Returns

List of matching indexes as tuples (i_ref, i_hyp)

set_gap(g)[source]

Fix the value of the gap.

Parameters

g – (int) Value of the gap (0<g<MAX_GAP)

Raises

GapRangeError

set_ngram(n)[source]

Fix the value of n of the n-grams.

Parameters

n – (int) Value of n (1<n<MAX_NGRAM)

Raises

NgramRangeError

set_score(s)[source]

Fix the value of the score.

Parameters

s – (float) Value of the score (0<s<1)

Raises

ScoreRangeError

class resources.sppasUnigram(filename=None, nodump=True)[source]

Bases: object

Class to represent a simple unigram: a set of token/count.

An unigram is commonly a data structure with tokens and their probabilities, and a back-off value. Is is a statistical language model. This class is a simplified version with only tokens and their occurrences.

Notice that tokens are case-sensitive.

__init__(filename=None, nodump=True)[source]

Create a sppasUnigram instance.

Parameters
  • filename – (str) Name of the file with words and counts (2 columns)

  • nodump – (bool) Disable the creation of a dump file

add(entry, value=1)[source]

Add or increment a token in the unigram.

Parameters
  • entry – (str) String of the token to add

  • value – (int) Value to increment the count

Raises

PositiveValueError

get_count(token)[source]

Return the count of a token.

Parameters

token – (str) The string of the token

get_sum()[source]

Return the sum of all counts (of all tokens).

get_tokens()[source]

Return a list with all tokens.

load_from_ascii(filename)[source]

Load a unigram from a file with two columns: word count.

Parameters

filename – (str) Name of the unigram ASCII file to read

save_as_ascii(filename)[source]

Save a unigram into a file with two columns: word freq.

Parameters

filename – (str) Name of the unigram ASCII file to write

Returns

(bool)

class resources.sppasVocabulary(filename=None, nodump=False, case_sensitive=False)[source]

Bases: object

Class to represent a list of words.

__init__(filename=None, nodump=False, case_sensitive=False)[source]

Create a sppasVocabulary instance.

Parameters
  • filename – (str) Name of the file with the list of words.

  • nodump – (bool) Allows to disable the creation of a dump file.

  • case_sensitive – (bool) the list of word is case-sensitive or not

add(entry)[source]

Add an entry into the list except if the entry is already inside.

Parameters

entry – (str) The entry to add in the word list

Returns

(bool)

clear()[source]

Remove all entries of the vocabulary.

copy()[source]

Make a deep copy of the instance.

Returns

sppasVocabulary

get_filename()[source]

Return the name of the file from which the vocab comes from.

get_list()[source]

Return the list of entries, sorted in alpha-numeric order.

is_in(entry)[source]

Return True if entry is in the list.

Parameters

entry – (str)

is_unk(entry)[source]

Return True if entry is unknown (not in the list).

Parameters

entry – (str)

load_from_ascii(filename)[source]

Read words from a file: one per line.

Parameters

filename – (str)

save(filename)[source]

Save the list of words in a file.

:param filename (str) :returns: (bool)

class resources.sppasWordStrain(filename=None)[source]

Bases: resources.dictrepl.sppasDictRepl

Sort of basic lemmatization.

__init__(filename=None)[source]

Create a WordStain instance.

Parameters

filename – (str) 2 or 3 columns file with word/freq/wordstrain

load(filename)[source]

Load word substitutions from a file.

Replace the existing substitutions.

Parameters

filename – (str) 2 or 3 columns file with word/freq/replacement