resources package¶
Submodules¶
resources.dictpron module¶
- filename
sppas.src.resources.dictpron.py
- author
Brigitte Bigi
- contact
- summary
Pronunciation dictionary resource model.
- class resources.dictpron.sppasDictPron(dict_filename=None, nodump=False)[source]¶
Bases:
object
Pronunciation dictionary manager.
A pronunciation dictionary contains a list of tokens, each one with a list of possible pronunciations.
sppasDictPron can load the dictionary from an HTK-ASCII file. Each line of such file looks like the following:
acted [acted] { k t e d acted(2) [acted] { k t i d
The first columns indicates the tokens, eventually followed by the variant number into braces. The second column (with brackets) is ignored. It should contain the token. Other columns are the phones separated by whitespace. sppasDictPron accepts missing variant numbers, empty brackets, or missing brackets.
>>> d = sppasDictPron('eng.dict') >>> d.add_pron('acted', '{ k t e') >>> d.add_pron('acted', '{ k t i')
Then, the phonetization of a token can be accessed with get_pron() method:
>>> print(d.get_pron('acted')) >>>{-k-t-e-d|{-k-t-i-d|{-k-t-e|{-k-t-i
The following convention is adopted to represent the pronunciation variants:
‘-’ separates the phones (X-SAMPA standard)
‘|’ separates the variants
Notice that tokens in the dict are case-insensitive.
- __init__(dict_filename=None, nodump=False)[source]¶
Create a sppasDictPron instance.
A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load is divided by two or three.
- Parameters
dict_filename – (str) Name of the file of the pronunciation dict
nodump – (bool) Create or not a dump file.
- add_pron(token, pron)[source]¶
Add a token/pron to the dict.
- Parameters
token – (str) Unicode string of the token to add
pron – (str) A pronunciation in which the phonemes are separated by whitespace
- static format_token(entry)[source]¶
Remove the CR/LF, tabs, multiple spaces and others… and lowerise.
- Parameters
entry – (str) a token
- Returns
formatted token
- get(entry, substitution='dummy')[source]¶
Return the pronunciations of an entry in the dictionary.
- Parameters
entry – (str) A token to find in the dictionary
substitution – (str) String to return if token is missing of dict
- Returns
unicode of the pronunciations or the substitution.
- get_pron(entry)[source]¶
Return the pronunciations of an entry in the dictionary.
- Parameters
entry – (str) A token to find in the dictionary
- Returns
unicode of the pronunciations or the unknown stamp.
- static ipa_to_sampa(conversion, ipa_entry)[source]¶
Convert a string in IPA to SAMPA.
- Parameters
conversion – (dict)
ipa_entry – (str)
- is_pron_of(entry, pron)[source]¶
Return True if pron is a pronunciation of entry.
Phonemes of pron are separated by “-“.
- Parameters
entry – (str) A unicode token to find in the dictionary
pron – (str) A unicode pronunciation
- Returns
bool
- is_unk(entry)[source]¶
Return True if an entry is unknown (not in the dictionary).
- Parameters
entry – (str) A token to find in the dictionary
- Returns
bool
- load(filename)[source]¶
Load a pronunciation dictionary.
- Parameters
filename – (str) Pronunciation dictionary file name
- load_from_ascii(filename)[source]¶
Load a pronunciation dictionary from an HTK-ASCII file.
- Parameters
filename – (str) Pronunciation dictionary file name
- load_from_pls(filename)[source]¶
Load a pronunciation dictionary from a pls file (xml).
xmlns=”http://www.w3.org/2005/01/pronunciation-lexicon
- Parameters
filename – (str) Pronunciation dictionary file name
- map_phones(map_table)[source]¶
Create a new dictionary by changing the phoneme strings.
Perform changes depending on a mapping table.
- Parameters
map_table – (Mapping) A mapping table
- Returns
a sppasDictPron instance with mapped phones
- save_as_ascii(filename, with_variant_nb=True, with_filled_brackets=True)[source]¶
Save the pronunciation dictionary in HTK-ASCII format.
- Parameters
filename – (str) Dictionary file name
with_variant_nb – (bool) Write the variant number or not
with_filled_brackets – (bool) Fill the bracket with the token
resources.dictrepl module¶
- filename
sppas.src.resources.dictrepl.py
- author
Brigitte Bigi
- contact
- summary
Substitution table.
- class resources.dictrepl.sppasDictRepl(dict_filename=None, nodump=False)[source]¶
Bases:
object
A dictionary to manage automated replacements.
A dictionary with specific features for language resources. The main feature is that values are “accumulated”.
>>>d = sppasDictRepl() >>>d.add(“key”, “v1”) >>>d.add(“key”, “v2”) >>>d.get(“key”) >>>v1|v2 >>>d.is_value(“v1”) >>>True >>>d.is_value(“v1|v2”) >>>False
- REPLACE_SEPARATOR = '|'¶
- __init__(dict_filename=None, nodump=False)[source]¶
Create a sppasDictRepl instance.
- Parameters
dict_filename – (str) The dictionary file name (2 columns)
nodump – (bool) Disable the creation of a dump file
A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load it is divided by two or three.
- add(token, repl)[source]¶
Add a new key,value into the dict.
Add as a new pair or append the value to the existing one with a “|” used as separator.
- Parameters
token – (str) string of the token to add
repl – (str) the replacement token
Both token and repl are converted to unicode (if any) and strip.
- static format_token(entry)[source]¶
Remove the CR/LF, tabs, multiple spaces and others… and lower.
- Parameters
entry – (str) a token
- Returns
formatted token
- get(entry, substitution='')[source]¶
Return the value of a key of the dictionary or substitution.
- Parameters
entry – (str) A token to find in the dictionary
substitution – (str) String to return if token is missing of the dict
- Returns
unicode of the replacement or the substitution.
- is_key(entry)[source]¶
Return True if entry is exactly a key in the dictionary.
- Parameters
entry – (str) Unicode string.
- is_unk(entry)[source]¶
Return True if entry is not a key in the dictionary.
- Parameters
entry – (str) Unicode string.
- is_value(entry)[source]¶
Return True if entry is a value in the dictionary.
- Parameters
entry – (str) Unicode string.
- is_value_of(key, entry)[source]¶
Return True if entry is a value of a given key in the dictionary.
- Parameters
key – (str) Unicode string.
entry – (str) Unicode string.
- load_from_ascii(filename)[source]¶
Load a replacement dictionary from an ascii file.
- Parameters
filename – (str) Replacement dictionary file name
- pop(entry)[source]¶
Remove an entry, as key.
- Parameters
entry – (str) unicode string of the entry to remove
- remove(entry)[source]¶
Remove an entry, as key or value.
- Parameters
entry – (str) unicode string of the entry to remove
resources.dumpfile module¶
- filename
sppas.src.resources.dumpfile.py
- author
Brigitte Bigi
- contact
- summary
Dump file for resource models.
- class resources.dumpfile.sppasDumpFile(filename, dump_extension='')[source]¶
Bases:
object
Class to manage dump files.
A dump file is a binary version of an ASCII file. Its size is greater than the original ASCII one but the time to load it is divided by two or three.
- DUMP_FILENAME_EXT = '.dump'¶
- __init__(filename, dump_extension='')[source]¶
Create a sppasDumpFile instance.
- Parameters
filename – (str) Name of the ASCII file.
dump_extension – (str) Extension of the dump file.
- get_dump_filename()[source]¶
Return the file name of the dump version of filename.
- Returns
name of the dump file
- save_as_dump(data)[source]¶
Save the data as a dumped file.
- Parameters
data – The data to save
- Returns
(bool)
- set_dump_extension(extension='')[source]¶
Fix the extension of the dump file.
Set to the default extension if the given extension is an empty string.
- Parameters
extension – (str) Extension of the dump file (starting with or without the dot).
- Raises
DumpExtensionError if extension of the dump file is the same as the ASCII file.
resources.mapping module¶
- filename
sppas.src.resources.mapping.py
- author
Brigitte Bigi
- contact
- summary
Mapping table.
- class resources.mapping.sppasMapping(dict_name=None)[source]¶
Bases:
resources.dictrepl.sppasDictRepl
Class to manage mapping tables.
A mapping is an extended replacement dictionary. sppasMapping is used for the management of a mapping table of any set of strings.
- DEFAULT_SEP = (';', ',', '\n', ' ', '.', '|', '+', '-')¶
- __init__(dict_name=None)[source]¶
Create a new sppasMapping instance.
- Parameters
dict_name – (str) file name with the mapping data (2 columns)
- map(mstr, delimiters=(';', ',', '\n', ' ', '.', '|', '+', '-'), separator='')[source]¶
Run the Mapping process on an input string.
- Parameters
mstr – input string to map
delimiters – (list) list of character delimiters. Default is: [‘;’, ‘,’, ‘ ‘, ‘.’, ‘|’, ‘+’, ‘-‘]
separator – (char) used to separate parts of the mapped result
(when longest matching algorithm was used to map a string) :returns: a string
- map_entry(entry)[source]¶
Map an entry (a key or a value).
- Parameters
entry – (str) input string to map
- Returns
mapped entry is a string
- set_keep_miss(keep_miss)[source]¶
Fix the keep_miss option.
- Parameters
keep_miss – (bool) If keep_miss is set to True, each missing
entry is kept without change; instead each missing entry is replaced by a specific symbol.
resources.patterns module¶
- filename
sppas.src.resources.patterns.py
- author
Brigitte Bigi
- contact
- summary
Pattern matching.
- class resources.patterns.sppasPatterns[source]¶
Bases:
object
Pattern matching.
Pattern matching aims at checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact.
Several pattern matching algorithms are implemented in this class. They allow to find an hypothesis pattern in a reference.
- MAX_GAP = 4¶
- MAX_NGRAM = 8¶
- dp_matching(ref, hyp)[source]¶
Dynamic Programming alignment of ref and hyp.
The DP alignment algorithm performs a global minimization of a Levenshtein distance function which weights the cost of correct words, insertions, deletions and substitutions as 0, 3, 3 and 4 respectively.
- See:
- TIME WARPS, STRING EDITS, AND MACROMOLECULES:THE THEORY AND PRACTICE OF SEQUENCE COMPARISON,by Sankoff and Kruskal, ISBN 0-201-07809-0
- ngram_alignments(ref, hyp)[source]¶
n-gram alignment of ref and hyp.
The algorithm is based on the finding of matching n-grams, in the range of a given gap. If 1-gram, keep only hypothesis items with a high confidence score. A gap of search has to be fixed. An interstice value ensure the gap between an item in the ref and in the hyp won’t be too far.
- Parameters
ref – (list of tokens) List of references
hyp – (list of tuples) List of hypothesis with their scores
The scores are supposed to range in [0;1] values. :returns: List of alignments indexes as tuples (i_ref,i_hyp),
Example:
- ref: w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12
- | | | | | || | | | /| | | | /
hyp: w0 w1 w2 wX w3 w5 w6 wX w9
Returned matches:
if n=3: [ (0,0), (1,1), (2,2) ]
if n=2: [(0, 0), (1, 1), (2, 2), (5, 5), (6, 6)]
if n=1, it depends on the scores in hyp and the value of the gap.
- ngram_matches(ref, hyp)[source]¶
n-gram matches between ref and hyp.
Search for common n-gram sequences of hyp in ref. The scores are supposed to range in [0;1] values.
- Parameters
ref – (list of tokens) List of references
hyp – (list of tuples) List of hypothesis with their scores
- Returns
List of matching indexes as tuples (i_ref, i_hyp)
- set_gap(g)[source]¶
Fix the value of the gap.
- Parameters
g – (int) Value of the gap (0<g<MAX_GAP)
- Raises
GapRangeError
resources.resourcesexc module¶
- filename
sppas.src.resources.resourcesexc.py
- author
Brigitte Bigi
- contact
- summary
Exceptions of the resources package.
- exception resources.resourcesexc.DumpExtensionError(extension)[source]¶
Bases:
ValueError
:ERROR 5030:.
The dump file can’t have the same extension as the ASCII file ({extension}).
- exception resources.resourcesexc.FileFormatError(line_number, filename)[source]¶
Bases:
ValueError
:ERROR 5015:.
Read file failed at line number {number}: {string}.
- exception resources.resourcesexc.FileIOError(filename)[source]¶
Bases:
Exception
:ERROR 5010:.
Error while trying to open and read the file: {name}.
- exception resources.resourcesexc.FileUnicodeError(filename)[source]¶
Bases:
UnicodeDecodeError
:ERROR 5005:.
Encoding error while trying to read the file: {name}.
- exception resources.resourcesexc.GapRangeError(maxi, value)[source]¶
Bases:
ValueError
:ERROR 5022:.
The gap value of pattern matching must range [0;{maximum}]. Got {observed}.
- exception resources.resourcesexc.NgramRangeError(maxi, value)[source]¶
Bases:
ValueError
:ERROR 5020:.
The n value of n-grams pattern matching must range [1;{maximum}]. Got {observed}.
resources.unigram module¶
- filename
sppas.src.resources.unigram.py
- author
Brigitte Bigi
- contact
- summary
Data structure for a set of token/count.
- class resources.unigram.sppasUnigram(filename=None, nodump=True)[source]¶
Bases:
object
Class to represent a simple unigram: a set of token/count.
An unigram is commonly a data structure with tokens and their probabilities, and a back-off value. Is is a statistical language model. This class is a simplified version with only tokens and their occurrences.
Notice that tokens are case-sensitive.
- __init__(filename=None, nodump=True)[source]¶
Create a sppasUnigram instance.
- Parameters
filename – (str) Name of the file with words and counts (2 columns)
nodump – (bool) Disable the creation of a dump file
- add(entry, value=1)[source]¶
Add or increment a token in the unigram.
- Parameters
entry – (str) String of the token to add
value – (int) Value to increment the count
- Raises
PositiveValueError
- get_count(token)[source]¶
Return the count of a token.
- Parameters
token – (str) The string of the token
resources.vocab module¶
- filename
sppas.src.resources.vocab.py
- author
Brigitte Bigi
- contact
- summary
A vocabulary resource model.
- class resources.vocab.sppasVocabulary(filename=None, nodump=False, case_sensitive=False)[source]¶
Bases:
object
Class to represent a list of words.
- __init__(filename=None, nodump=False, case_sensitive=False)[source]¶
Create a sppasVocabulary instance.
- Parameters
filename – (str) Name of the file with the list of words.
nodump – (bool) Allows to disable the creation of a dump file.
case_sensitive – (bool) the list of word is case-sensitive or not
- add(entry)[source]¶
Add an entry into the list except if the entry is already inside.
- Parameters
entry – (str) The entry to add in the word list
- Returns
(bool)
resources.wordstrain module¶
- filename
sppas.src.resources.wordstrain.py
- author
Brigitte Bigi
- contact
- summary
A very simplified but multilingual lemmatizer.
- class resources.wordstrain.sppasWordStrain(filename=None)[source]¶
Bases:
resources.dictrepl.sppasDictRepl
Sort of basic lemmatization.
Module contents¶
- filename
sppas.src.resources.__init__.py
- author
Brigitte Bigi
- contact
- summary
Resource models of SPPAS.
resources: access and manage linguistic resources¶
This package includes classes to manage the data of linguistic types like lexicons, pronunciation dictionaries, patterns, etc.
Requires the following other packages:
config
- class resources.sppasDictPron(dict_filename=None, nodump=False)[source]¶
Bases:
object
Pronunciation dictionary manager.
A pronunciation dictionary contains a list of tokens, each one with a list of possible pronunciations.
sppasDictPron can load the dictionary from an HTK-ASCII file. Each line of such file looks like the following:
acted [acted] { k t e d acted(2) [acted] { k t i d
The first columns indicates the tokens, eventually followed by the variant number into braces. The second column (with brackets) is ignored. It should contain the token. Other columns are the phones separated by whitespace. sppasDictPron accepts missing variant numbers, empty brackets, or missing brackets.
>>> d = sppasDictPron('eng.dict') >>> d.add_pron('acted', '{ k t e') >>> d.add_pron('acted', '{ k t i')
Then, the phonetization of a token can be accessed with get_pron() method:
>>> print(d.get_pron('acted')) >>>{-k-t-e-d|{-k-t-i-d|{-k-t-e|{-k-t-i
The following convention is adopted to represent the pronunciation variants:
‘-’ separates the phones (X-SAMPA standard)
‘|’ separates the variants
Notice that tokens in the dict are case-insensitive.
- __init__(dict_filename=None, nodump=False)[source]¶
Create a sppasDictPron instance.
A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load is divided by two or three.
- Parameters
dict_filename – (str) Name of the file of the pronunciation dict
nodump – (bool) Create or not a dump file.
- add_pron(token, pron)[source]¶
Add a token/pron to the dict.
- Parameters
token – (str) Unicode string of the token to add
pron – (str) A pronunciation in which the phonemes are separated by whitespace
- static format_token(entry)[source]¶
Remove the CR/LF, tabs, multiple spaces and others… and lowerise.
- Parameters
entry – (str) a token
- Returns
formatted token
- get(entry, substitution='dummy')[source]¶
Return the pronunciations of an entry in the dictionary.
- Parameters
entry – (str) A token to find in the dictionary
substitution – (str) String to return if token is missing of dict
- Returns
unicode of the pronunciations or the substitution.
- get_pron(entry)[source]¶
Return the pronunciations of an entry in the dictionary.
- Parameters
entry – (str) A token to find in the dictionary
- Returns
unicode of the pronunciations or the unknown stamp.
- static ipa_to_sampa(conversion, ipa_entry)[source]¶
Convert a string in IPA to SAMPA.
- Parameters
conversion – (dict)
ipa_entry – (str)
- is_pron_of(entry, pron)[source]¶
Return True if pron is a pronunciation of entry.
Phonemes of pron are separated by “-“.
- Parameters
entry – (str) A unicode token to find in the dictionary
pron – (str) A unicode pronunciation
- Returns
bool
- is_unk(entry)[source]¶
Return True if an entry is unknown (not in the dictionary).
- Parameters
entry – (str) A token to find in the dictionary
- Returns
bool
- load(filename)[source]¶
Load a pronunciation dictionary.
- Parameters
filename – (str) Pronunciation dictionary file name
- load_from_ascii(filename)[source]¶
Load a pronunciation dictionary from an HTK-ASCII file.
- Parameters
filename – (str) Pronunciation dictionary file name
- load_from_pls(filename)[source]¶
Load a pronunciation dictionary from a pls file (xml).
xmlns=”http://www.w3.org/2005/01/pronunciation-lexicon
- Parameters
filename – (str) Pronunciation dictionary file name
- map_phones(map_table)[source]¶
Create a new dictionary by changing the phoneme strings.
Perform changes depending on a mapping table.
- Parameters
map_table – (Mapping) A mapping table
- Returns
a sppasDictPron instance with mapped phones
- save_as_ascii(filename, with_variant_nb=True, with_filled_brackets=True)[source]¶
Save the pronunciation dictionary in HTK-ASCII format.
- Parameters
filename – (str) Dictionary file name
with_variant_nb – (bool) Write the variant number or not
with_filled_brackets – (bool) Fill the bracket with the token
- class resources.sppasDictRepl(dict_filename=None, nodump=False)[source]¶
Bases:
object
A dictionary to manage automated replacements.
A dictionary with specific features for language resources. The main feature is that values are “accumulated”.
>>>d = sppasDictRepl() >>>d.add(“key”, “v1”) >>>d.add(“key”, “v2”) >>>d.get(“key”) >>>v1|v2 >>>d.is_value(“v1”) >>>True >>>d.is_value(“v1|v2”) >>>False
- REPLACE_SEPARATOR = '|'¶
- __init__(dict_filename=None, nodump=False)[source]¶
Create a sppasDictRepl instance.
- Parameters
dict_filename – (str) The dictionary file name (2 columns)
nodump – (bool) Disable the creation of a dump file
A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load it is divided by two or three.
- add(token, repl)[source]¶
Add a new key,value into the dict.
Add as a new pair or append the value to the existing one with a “|” used as separator.
- Parameters
token – (str) string of the token to add
repl – (str) the replacement token
Both token and repl are converted to unicode (if any) and strip.
- static format_token(entry)[source]¶
Remove the CR/LF, tabs, multiple spaces and others… and lower.
- Parameters
entry – (str) a token
- Returns
formatted token
- get(entry, substitution='')[source]¶
Return the value of a key of the dictionary or substitution.
- Parameters
entry – (str) A token to find in the dictionary
substitution – (str) String to return if token is missing of the dict
- Returns
unicode of the replacement or the substitution.
- is_key(entry)[source]¶
Return True if entry is exactly a key in the dictionary.
- Parameters
entry – (str) Unicode string.
- is_unk(entry)[source]¶
Return True if entry is not a key in the dictionary.
- Parameters
entry – (str) Unicode string.
- is_value(entry)[source]¶
Return True if entry is a value in the dictionary.
- Parameters
entry – (str) Unicode string.
- is_value_of(key, entry)[source]¶
Return True if entry is a value of a given key in the dictionary.
- Parameters
key – (str) Unicode string.
entry – (str) Unicode string.
- load_from_ascii(filename)[source]¶
Load a replacement dictionary from an ascii file.
- Parameters
filename – (str) Replacement dictionary file name
- pop(entry)[source]¶
Remove an entry, as key.
- Parameters
entry – (str) unicode string of the entry to remove
- remove(entry)[source]¶
Remove an entry, as key or value.
- Parameters
entry – (str) unicode string of the entry to remove
- class resources.sppasDumpFile(filename, dump_extension='')[source]¶
Bases:
object
Class to manage dump files.
A dump file is a binary version of an ASCII file. Its size is greater than the original ASCII one but the time to load it is divided by two or three.
- DUMP_FILENAME_EXT = '.dump'¶
- __init__(filename, dump_extension='')[source]¶
Create a sppasDumpFile instance.
- Parameters
filename – (str) Name of the ASCII file.
dump_extension – (str) Extension of the dump file.
- get_dump_filename()[source]¶
Return the file name of the dump version of filename.
- Returns
name of the dump file
- save_as_dump(data)[source]¶
Save the data as a dumped file.
- Parameters
data – The data to save
- Returns
(bool)
- set_dump_extension(extension='')[source]¶
Fix the extension of the dump file.
Set to the default extension if the given extension is an empty string.
- Parameters
extension – (str) Extension of the dump file (starting with or without the dot).
- Raises
DumpExtensionError if extension of the dump file is the same as the ASCII file.
- class resources.sppasMapping(dict_name=None)[source]¶
Bases:
resources.dictrepl.sppasDictRepl
Class to manage mapping tables.
A mapping is an extended replacement dictionary. sppasMapping is used for the management of a mapping table of any set of strings.
- DEFAULT_SEP = (';', ',', '\n', ' ', '.', '|', '+', '-')¶
- __init__(dict_name=None)[source]¶
Create a new sppasMapping instance.
- Parameters
dict_name – (str) file name with the mapping data (2 columns)
- map(mstr, delimiters=(';', ',', '\n', ' ', '.', '|', '+', '-'), separator='')[source]¶
Run the Mapping process on an input string.
- Parameters
mstr – input string to map
delimiters – (list) list of character delimiters. Default is: [‘;’, ‘,’, ‘ ‘, ‘.’, ‘|’, ‘+’, ‘-‘]
separator – (char) used to separate parts of the mapped result
(when longest matching algorithm was used to map a string) :returns: a string
- map_entry(entry)[source]¶
Map an entry (a key or a value).
- Parameters
entry – (str) input string to map
- Returns
mapped entry is a string
- set_keep_miss(keep_miss)[source]¶
Fix the keep_miss option.
- Parameters
keep_miss – (bool) If keep_miss is set to True, each missing
entry is kept without change; instead each missing entry is replaced by a specific symbol.
- class resources.sppasPatterns[source]¶
Bases:
object
Pattern matching.
Pattern matching aims at checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact.
Several pattern matching algorithms are implemented in this class. They allow to find an hypothesis pattern in a reference.
- MAX_GAP = 4¶
- MAX_NGRAM = 8¶
- dp_matching(ref, hyp)[source]¶
Dynamic Programming alignment of ref and hyp.
The DP alignment algorithm performs a global minimization of a Levenshtein distance function which weights the cost of correct words, insertions, deletions and substitutions as 0, 3, 3 and 4 respectively.
- See:
- TIME WARPS, STRING EDITS, AND MACROMOLECULES:THE THEORY AND PRACTICE OF SEQUENCE COMPARISON,by Sankoff and Kruskal, ISBN 0-201-07809-0
- ngram_alignments(ref, hyp)[source]¶
n-gram alignment of ref and hyp.
The algorithm is based on the finding of matching n-grams, in the range of a given gap. If 1-gram, keep only hypothesis items with a high confidence score. A gap of search has to be fixed. An interstice value ensure the gap between an item in the ref and in the hyp won’t be too far.
- Parameters
ref – (list of tokens) List of references
hyp – (list of tuples) List of hypothesis with their scores
The scores are supposed to range in [0;1] values. :returns: List of alignments indexes as tuples (i_ref,i_hyp),
Example:
- ref: w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12
- | | | | | || | | | /| | | | /
hyp: w0 w1 w2 wX w3 w5 w6 wX w9
Returned matches:
if n=3: [ (0,0), (1,1), (2,2) ]
if n=2: [(0, 0), (1, 1), (2, 2), (5, 5), (6, 6)]
if n=1, it depends on the scores in hyp and the value of the gap.
- ngram_matches(ref, hyp)[source]¶
n-gram matches between ref and hyp.
Search for common n-gram sequences of hyp in ref. The scores are supposed to range in [0;1] values.
- Parameters
ref – (list of tokens) List of references
hyp – (list of tuples) List of hypothesis with their scores
- Returns
List of matching indexes as tuples (i_ref, i_hyp)
- set_gap(g)[source]¶
Fix the value of the gap.
- Parameters
g – (int) Value of the gap (0<g<MAX_GAP)
- Raises
GapRangeError
- class resources.sppasUnigram(filename=None, nodump=True)[source]¶
Bases:
object
Class to represent a simple unigram: a set of token/count.
An unigram is commonly a data structure with tokens and their probabilities, and a back-off value. Is is a statistical language model. This class is a simplified version with only tokens and their occurrences.
Notice that tokens are case-sensitive.
- __init__(filename=None, nodump=True)[source]¶
Create a sppasUnigram instance.
- Parameters
filename – (str) Name of the file with words and counts (2 columns)
nodump – (bool) Disable the creation of a dump file
- add(entry, value=1)[source]¶
Add or increment a token in the unigram.
- Parameters
entry – (str) String of the token to add
value – (int) Value to increment the count
- Raises
PositiveValueError
- get_count(token)[source]¶
Return the count of a token.
- Parameters
token – (str) The string of the token
- class resources.sppasVocabulary(filename=None, nodump=False, case_sensitive=False)[source]¶
Bases:
object
Class to represent a list of words.
- __init__(filename=None, nodump=False, case_sensitive=False)[source]¶
Create a sppasVocabulary instance.
- Parameters
filename – (str) Name of the file with the list of words.
nodump – (bool) Allows to disable the creation of a dump file.
case_sensitive – (bool) the list of word is case-sensitive or not
- add(entry)[source]¶
Add an entry into the list except if the entry is already inside.
- Parameters
entry – (str) The entry to add in the word list
- Returns
(bool)
- class resources.sppasWordStrain(filename=None)[source]¶
Bases:
resources.dictrepl.sppasDictRepl
Sort of basic lemmatization.