annotations.TextNorm package¶
Subpackages¶
- annotations.TextNorm.num2text package
- Submodules
- annotations.TextNorm.num2text.construct module
- annotations.TextNorm.num2text.num_asian_lang module
- annotations.TextNorm.num2text.num_base module
- annotations.TextNorm.num2text.num_cmn module
- annotations.TextNorm.num2text.num_europ_lang module
- annotations.TextNorm.num2text.num_fra module
- annotations.TextNorm.num2text.num_ita module
- annotations.TextNorm.num2text.num_jpn module
- annotations.TextNorm.num2text.num_khm module
- annotations.TextNorm.num2text.num_pol module
- annotations.TextNorm.num2text.num_spa module
- annotations.TextNorm.num2text.num_und module
- annotations.TextNorm.num2text.num_vie module
- annotations.TextNorm.num2text.por_num module
- Module contents
Submodules¶
annotations.TextNorm.language module¶
- filename
sppas.src.annotations.TextNorm.language.py
- author
Brigitte Bigi
- contact
- summary
Language name definition.
- class annotations.TextNorm.language.sppasLangISO[source]¶
Bases:
object
Language name definition.
todo: parse a iso639-3 json file to load all language names.
- lang_list = ['cmn', 'jpn', 'yue', 'zho', 'cdo', 'cjy', 'cmo', 'cpx', 'czh', 'czo', 'czt', 'gan', 'hak', 'hsn', 'ltc', 'lzh', 'mnp', 'och', 'wuu', 'ben']¶
- static without_whitespace(lang)[source]¶
Return true if ‘lang’ is not using whitespace.
Mandarin Chinese or Japanese languages return True, but English or French return False.
- Parameters
lang – (str) iso639-3 language code or a string starting with such code, like “yue” or “yue-chars” for example.
- Returns
(bool)
annotations.TextNorm.normalize module¶
- filename
sppas.src.annotations.TextNorm.normalize.py
- author
Brigitte Bigi
- contact
- summary
Multilingual Text Normalization of an utterance.
- class annotations.TextNorm.normalize.DictReplUTF8[source]¶
Bases:
sppas.src.resources.dictrepl.sppasDictRepl
Replacement dictionary of UTF8 characters that caused problems.
This is a hack to match with our dictionaries…
TODO: This class should read an external replacement file…
- __init__()[source]¶
Create a sppasDictRepl instance.
- Parameters
dict_filename – (str) The dictionary file name (2 columns)
nodump – (bool) Disable the creation of a dump file
A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load it is divided by two or three.
- class annotations.TextNorm.normalize.TextNormalizer(vocab=None, lang='und')[source]¶
Bases:
object
Multilingual text normalization
- __init__(vocab=None, lang='und')[source]¶
Create a TextNormalizer instance.
- Parameters
vocab – (sppasVocabulary)
lang – the language code in iso639-3.
- normalize(entry, actions=[])[source]¶
Tokenize an utterance.
- Parameters
entry – (str) the string to normalize
actions –
(list) the modules/options to enable.
”std”: generated the standard orthography instead of the faked one
”replace”: use a replacement dictionary
”tokenize”: tokenize the entry
”numbers”: convert numbers to their written form
”lower”: change case of characters to lower
”punct”: remove punctuation
- Returns
(str) the list of normalized tokens
Important: An empty actions list or a list containing only “std” means to enable all actions.
- remove(utt, wlist)[source]¶
Remove data of an utterance if included in a dictionary.
Only used to remove punctuation.
- Parameters
utt – (list)
wlist – (WordList)
- replace(utt)[source]¶
Examine tokens and performs some replacements.
A dictionary with symbols contains the replacements to operate.
- Parameters
utt – (list) the utterance
- Returns
A list of strings
- set_delim(delim)[source]¶
Set the delimiter, used to separate tokens.
- Parameters
delim – (str) a unicode character.
- set_lang(lang)[source]¶
Set the language.
- Parameters
lang – (str) the language code in iso639-3 (fra, eng, vie…).
annotations.TextNorm.num2letter module¶
- filename
sppas.src.annotations.TextNorm.num2letter.py
- author
Brigitte Bigi
- contact
- summary
Numerical to string.
Module to convert numbers to their written form the multilingual text normalization system. Num2Letter conversion is language-specific.
- class annotations.TextNorm.num2letter.sppasNum(lang='und')[source]¶
Bases:
object
Numerical conversion using a multilingual algorithm.
The language names used in this class are based on iso639-3.
>>> num = sppasNum('fra') >>> num.convert("3") trois >>> num.convert("03") >>>zéro-trois >>> sppasNum('3.0') ValueError
Notice that this class should be fully re-implemented. It should use an external resource file to make the match between numbers and letters, for each language:
0 zéro 1 un … 10 dix 100 cent 1000 mille 1000000 million 1000000000 milliard
- LANGUAGES = ['und', 'yue', 'cmn', 'fra', 'ita', 'eng', 'spa', 'khm', 'vie', 'jpn', 'pol', 'por', 'pcm']¶
- ZERO = {'cmn': '零', 'eng': 'zero', 'fra': 'zéro', 'ita': 'zero', 'jpn': 'ゼロ', 'khm': 'ស្សូន ', 'pol': 'zerowej', 'por': 'zero', 'spa': 'cero', 'und': '0', 'vie': 'không', 'yue': '零'}¶
- __init__(lang='und')[source]¶
Create a new sppasNum instance.
- Parameters
lang – (str) the language code in ISO639-3 (fra, eng, spa,
khm, ita, …). If lang is set to “und” (undetermined), no conversion is performed.
- convert(number)[source]¶
Convert a number to a string. Example: 23 => twenty-three
- Parameters
number – (int) A numerical representation
- Returns
string corresponding to the given number
- Raises
ValueError
annotations.TextNorm.orthotranscription module¶
- filename
sppas.src.annotations.TextNorm.__init__.py
- author
Brigitte Bigi
- contact
- summary
Manage an enriched orthographic transcription.
- class annotations.TextNorm.orthotranscription.sppasOrthoTranscription[source]¶
Bases:
object
Manager of an orthographic transcription.
This is a totally language-independent class. It supports the orthographic transcription defined into SPPAS software tool.
From the manual Enriched Orthographic Transcription, two derived ortho. transcriptions are generated automatically by the tokenizer: the “standard” transcription (the list of orthographic tokens); the “faked spelling” that is a specific transcription from which the obtained phonetic tokens are used by the phonetization system.
The following illustrates an utterance text normalization in French:
Transcription:
j’ai on a j’ai p- (en)fin j’ai trouvé l(e) meilleur moyen c’était d(e) [loger,locher] chez des amis (English translation is: I’ve we’ve I’ve - well I found the best way was to live in friends’ apartment’)
Result of the standard tokens:
j’ ai on a j’ ai p- enfin j’ ai trouvé le meilleur moyen c’ était de loger chez des amis
Result of the faked tokens:
j’ ai on a j’ ai p- fin j’ ai trouvé l meilleur moyen c’ était d loche chez des amis
- static clean_toe(entry)[source]¶
Clean Enriched Orthographic Transcription.
The convention includes information that must be removed.
- Parameters
entry – (str)
- Returns
(str)
- static toe_spelling(entry, std=False)[source]¶
Create a specific spelling from an Enriched Orthographic Transcription.
- Parameters
entry – (str) the EOT string
std – (bool) Standard spelling expected instead of the Faked one.
- Returns
(str)
DevNote: Python’s regular expression engine supports Unicode. It can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for w (and s, and ), use the “(?u)” flag prefix, or the re.UNICODE flag.
annotations.TextNorm.splitter module¶
- filename
sppas.src.annotations.TextNorm.splitter.py
- author
Brigitte Bigi
- contact
- summary
Split step of the normalization automatic annotation.
- class annotations.TextNorm.splitter.sppasSimpleSplitter(lang, dict_replace=None, speech=True)[source]¶
Bases:
object
Utterance splitter
Module to split a string for the multilingual text normalization system. Split an utterance into tokens using whitespace or characters.
Should be extended to properly split telephone numbers or dates, etc. (for written texts).
- __init__(lang, dict_replace=None, speech=True)[source]¶
Creates a sppasSimpleSplitter.
- Parameters
lang – the language code in iso639-3.
dict_replace – Replacement dictionary
speech – (bool) split transcribed speech vs written text
annotations.TextNorm.sppastextnorm module¶
- filename
sppas.src.annotations.TextNorm.sppastextnorm.py
- author
Brigitte Bigi
- contact
- summary
SPPAS integration of Text Normalization automatic annotation.
- class annotations.TextNorm.sppastextnorm.sppasTextNorm(log=None)[source]¶
Bases:
annotations.baseannot.sppasBaseAnnotation
Text normalization automatic annotation.
- __init__(log=None)[source]¶
Create a sppasTextNorm instance without any linguistic resources.
- Parameters
log – (sppasLog) Human-readable logs.
- convert(tier)[source]¶
Text normalization of all labels of a tier.
- Parameters
tier – (sppasTier) the orthographic transcription (standard or EOT)
- Returns
A tuple with 3 tiers named: - “Tokens-Faked”, - “Tokens-Std”, - “Tokens-Custom”
- fix_options(options)[source]¶
Fix all options. Available options are:
faked
std
custom
- Parameters
options – (sppasOption)
- get_inputs(input_files)[source]¶
Return the the tier with aligned tokens.
- Parameters
input_files – (list)
- Raise
NoTierInputError
- Returns
(sppasTier)
- load_resources(vocab_filename, lang='und', **kwargs)[source]¶
Fix the list of words of a given language.
It allows a better tokenization, and enables the language-dependent modules like num2letters.
- Parameters
vocab_filename – (str) File with the orthographic transcription
lang – (str) the language code
- run(input_files, output=None)[source]¶
Run the automatic annotation process on an input.
- Parameters
input_files – (list of str) orthographic transcription
output – (str) the output file name
- Returns
(sppasTranscription)
- set_custom(value)[source]¶
Fix the custom option.
- Parameters
value – (bool) Create a customized tokenization
- set_faked(value)[source]¶
Fix the faked option.
- Parameters
value – (bool) Create a faked tokenization
annotations.TextNorm.tokenize module¶
- filename
sppas.src.annotations.TextNorm.tokenize.py
- author
Brigitte Bigi
- contact
- summary
Tokenization module for the multilingual text norm system.
- class annotations.TextNorm.tokenize.sppasTokenSegmenter(vocab=None)[source]¶
Bases:
object
Create words from tokens on the basis of a lexicon.
This is a totally language independent method, based on a longest matching algorithm to aggregate tokens into words. Words of a lexicon are found and:
1/ unbind or not if they contain a separator character:
rock’n’roll -> rock’n’roll
I’m -> I ‘m
it’s -> it ‘s
2/ bind using a character separator like for example, with ‘_’:
parce que -> parce_que
rock’n roll -> rock’n_roll
- SEPARATOR = '_'¶
- STICK_MAX = 7¶
- __init__(vocab=None)[source]¶
Create a new sppasTokenSegmenter instance.
- Parameters
vocab – (Vocabulary)
- bind(utt)[source]¶
Bind tokens of an utterance using a specific character.
- Parameters
utt – (list) List of tokens of an utterance (a transcription, a sentence, …)
- Returns
A list of strings
- set_aggregate_max(value=7)[source]¶
Fix the maximum number of words to stick.
This is a language dependant value. For French, it’s 5 with the word: “au fur et à mesure”. But it can be more to stick phrases instead of words for example.
- Parameters
value – (int) Maximum number of tokens to aggregate/stick.
Module contents¶
- filename
sppas.src.annotations.TextNorm.__init__.py
- author
Brigitte Bigi
- contact
- summary
Text Normalization automatic annotation.
The creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This package implements a generic approach for text normalization that can be applied on a multipurpose multilingual text or transcribed corpus. It consists in splitting the text normalization problem in a set of minor sub-problems as language-independent as possible. The portability to a new language consists of heritage of all language independent methods and rapid adaptation of other language dependent methods or classes.
For details, read the following reference:
Brigitte Bigi (2011).A Multilingual Text Normalization Approach.2nd Less-Resourced Languages workshop,5th Language & Technology Conference, Poznan (Poland).
- class annotations.TextNorm.TextNormalizer(vocab=None, lang='und')[source]¶
Bases:
object
Multilingual text normalization
- __init__(vocab=None, lang='und')[source]¶
Create a TextNormalizer instance.
- Parameters
vocab – (sppasVocabulary)
lang – the language code in iso639-3.
- normalize(entry, actions=[])[source]¶
Tokenize an utterance.
- Parameters
entry – (str) the string to normalize
actions –
(list) the modules/options to enable.
”std”: generated the standard orthography instead of the faked one
”replace”: use a replacement dictionary
”tokenize”: tokenize the entry
”numbers”: convert numbers to their written form
”lower”: change case of characters to lower
”punct”: remove punctuation
- Returns
(str) the list of normalized tokens
Important: An empty actions list or a list containing only “std” means to enable all actions.
- remove(utt, wlist)[source]¶
Remove data of an utterance if included in a dictionary.
Only used to remove punctuation.
- Parameters
utt – (list)
wlist – (WordList)
- replace(utt)[source]¶
Examine tokens and performs some replacements.
A dictionary with symbols contains the replacements to operate.
- Parameters
utt – (list) the utterance
- Returns
A list of strings
- set_delim(delim)[source]¶
Set the delimiter, used to separate tokens.
- Parameters
delim – (str) a unicode character.
- set_lang(lang)[source]¶
Set the language.
- Parameters
lang – (str) the language code in iso639-3 (fra, eng, vie…).
- class annotations.TextNorm.sppasOrthoTranscription[source]¶
Bases:
object
Manager of an orthographic transcription.
This is a totally language-independent class. It supports the orthographic transcription defined into SPPAS software tool.
From the manual Enriched Orthographic Transcription, two derived ortho. transcriptions are generated automatically by the tokenizer: the “standard” transcription (the list of orthographic tokens); the “faked spelling” that is a specific transcription from which the obtained phonetic tokens are used by the phonetization system.
The following illustrates an utterance text normalization in French:
Transcription:
j’ai on a j’ai p- (en)fin j’ai trouvé l(e) meilleur moyen c’était d(e) [loger,locher] chez des amis (English translation is: I’ve we’ve I’ve - well I found the best way was to live in friends’ apartment’)
Result of the standard tokens:
j’ ai on a j’ ai p- enfin j’ ai trouvé le meilleur moyen c’ était de loger chez des amis
Result of the faked tokens:
j’ ai on a j’ ai p- fin j’ ai trouvé l meilleur moyen c’ était d loche chez des amis
- static clean_toe(entry)[source]¶
Clean Enriched Orthographic Transcription.
The convention includes information that must be removed.
- Parameters
entry – (str)
- Returns
(str)
- static toe_spelling(entry, std=False)[source]¶
Create a specific spelling from an Enriched Orthographic Transcription.
- Parameters
entry – (str) the EOT string
std – (bool) Standard spelling expected instead of the Faked one.
- Returns
(str)
DevNote: Python’s regular expression engine supports Unicode. It can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for w (and s, and ), use the “(?u)” flag prefix, or the re.UNICODE flag.
- class annotations.TextNorm.sppasSimpleSplitter(lang, dict_replace=None, speech=True)[source]¶
Bases:
object
Utterance splitter
Module to split a string for the multilingual text normalization system. Split an utterance into tokens using whitespace or characters.
Should be extended to properly split telephone numbers or dates, etc. (for written texts).
- __init__(lang, dict_replace=None, speech=True)[source]¶
Creates a sppasSimpleSplitter.
- Parameters
lang – the language code in iso639-3.
dict_replace – Replacement dictionary
speech – (bool) split transcribed speech vs written text
- class annotations.TextNorm.sppasTextNorm(log=None)[source]¶
Bases:
annotations.baseannot.sppasBaseAnnotation
Text normalization automatic annotation.
- __init__(log=None)[source]¶
Create a sppasTextNorm instance without any linguistic resources.
- Parameters
log – (sppasLog) Human-readable logs.
- convert(tier)[source]¶
Text normalization of all labels of a tier.
- Parameters
tier – (sppasTier) the orthographic transcription (standard or EOT)
- Returns
A tuple with 3 tiers named: - “Tokens-Faked”, - “Tokens-Std”, - “Tokens-Custom”
- fix_options(options)[source]¶
Fix all options. Available options are:
faked
std
custom
- Parameters
options – (sppasOption)
- get_inputs(input_files)[source]¶
Return the the tier with aligned tokens.
- Parameters
input_files – (list)
- Raise
NoTierInputError
- Returns
(sppasTier)
- load_resources(vocab_filename, lang='und', **kwargs)[source]¶
Fix the list of words of a given language.
It allows a better tokenization, and enables the language-dependent modules like num2letters.
- Parameters
vocab_filename – (str) File with the orthographic transcription
lang – (str) the language code
- run(input_files, output=None)[source]¶
Run the automatic annotation process on an input.
- Parameters
input_files – (list of str) orthographic transcription
output – (str) the output file name
- Returns
(sppasTranscription)
- set_custom(value)[source]¶
Fix the custom option.
- Parameters
value – (bool) Create a customized tokenization
- set_faked(value)[source]¶
Fix the faked option.
- Parameters
value – (bool) Create a faked tokenization
- class annotations.TextNorm.sppasTokenSegmenter(vocab=None)[source]¶
Bases:
object
Create words from tokens on the basis of a lexicon.
This is a totally language independent method, based on a longest matching algorithm to aggregate tokens into words. Words of a lexicon are found and:
1/ unbind or not if they contain a separator character:
rock’n’roll -> rock’n’roll
I’m -> I ‘m
it’s -> it ‘s
2/ bind using a character separator like for example, with ‘_’:
parce que -> parce_que
rock’n roll -> rock’n_roll
- SEPARATOR = '_'¶
- STICK_MAX = 7¶
- __init__(vocab=None)[source]¶
Create a new sppasTokenSegmenter instance.
- Parameters
vocab – (Vocabulary)
- bind(utt)[source]¶
Bind tokens of an utterance using a specific character.
- Parameters
utt – (list) List of tokens of an utterance (a transcription, a sentence, …)
- Returns
A list of strings
- set_aggregate_max(value=7)[source]¶
Fix the maximum number of words to stick.
This is a language dependant value. For French, it’s 5 with the word: “au fur et à mesure”. But it can be more to stick phrases instead of words for example.
- Parameters
value – (int) Maximum number of tokens to aggregate/stick.