annotations.TextNorm package

Subpackages

Submodules

annotations.TextNorm.language module

filename

sppas.src.annotations.TextNorm.language.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Language name definition.

class annotations.TextNorm.language.sppasLangISO[source]

Bases: object

Language name definition.

todo: parse a iso639-3 json file to load all language names.

lang_list = ['cmn', 'jpn', 'yue', 'zho', 'cdo', 'cjy', 'cmo', 'cpx', 'czh', 'czo', 'czt', 'gan', 'hak', 'hsn', 'ltc', 'lzh', 'mnp', 'och', 'wuu', 'ben']
static without_whitespace(lang)[source]

Return true if ‘lang’ is not using whitespace.

Mandarin Chinese or Japanese languages return True, but English or French return False.

Parameters

lang – (str) iso639-3 language code or a string starting with such code, like “yue” or “yue-chars” for example.

Returns

(bool)

annotations.TextNorm.normalize module

filename

sppas.src.annotations.TextNorm.normalize.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Multilingual Text Normalization of an utterance.

class annotations.TextNorm.normalize.DictReplUTF8[source]

Bases: sppas.src.resources.dictrepl.sppasDictRepl

Replacement dictionary of UTF8 characters that caused problems.

This is a hack to match with our dictionaries…

TODO: This class should read an external replacement file…

__init__()[source]

Create a sppasDictRepl instance.

Parameters
  • dict_filename – (str) The dictionary file name (2 columns)

  • nodump – (bool) Disable the creation of a dump file

A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load it is divided by two or three.

class annotations.TextNorm.normalize.TextNormalizer(vocab=None, lang='und')[source]

Bases: object

Multilingual text normalization

__init__(vocab=None, lang='und')[source]

Create a TextNormalizer instance.

Parameters
  • vocab – (sppasVocabulary)

  • lang – the language code in iso639-3.

get_vocab_filename()[source]
lower(utt)[source]

Lower a list of strings.

Parameters

utt – (list)

normalize(entry, actions=[])[source]

Tokenize an utterance.

Parameters
  • entry – (str) the string to normalize

  • actions

    (list) the modules/options to enable.

    • ”std”: generated the standard orthography instead of the faked one

    • ”replace”: use a replacement dictionary

    • ”tokenize”: tokenize the entry

    • ”numbers”: convert numbers to their written form

    • ”lower”: change case of characters to lower

    • ”punct”: remove punctuation

Returns

(str) the list of normalized tokens

Important: An empty actions list or a list containing only “std” means to enable all actions.

numbers(utt)[source]

Convert numbers to their written form.

Parameters

utt – (list)

Returns

(list)

remove(utt, wlist)[source]

Remove data of an utterance if included in a dictionary.

Only used to remove punctuation.

Parameters
  • utt – (list)

  • wlist – (WordList)

replace(utt)[source]

Examine tokens and performs some replacements.

A dictionary with symbols contains the replacements to operate.

Parameters

utt – (list) the utterance

Returns

A list of strings

set_delim(delim)[source]

Set the delimiter, used to separate tokens.

Parameters

delim – (str) a unicode character.

set_lang(lang)[source]

Set the language.

Parameters

lang – (str) the language code in iso639-3 (fra, eng, vie…).

set_num(num_dict)[source]

Set the dictionary of numbers.

Parameters

num_dict – (sppasDictRepl)

set_punct(punct)[source]

Set the list of punctuation.

Parameters

punct – (sppasVocabulary)

set_repl(repl)[source]

Set the dictionary of replacements.

Parameters

repl – (sppasDictRepl)

set_vocab(vocab)[source]

Set the lexicon.

Parameters

vocab – (sppasVocabulary).

tokenize(utt)[source]

Tokenize is the text segmentation, ie segment into tokens.

Parameters

utt – (list)

Returns

(list)

static variants(utt)[source]

Convert strings that are variants in the utterance.

Parameters

utt – (list)

annotations.TextNorm.num2letter module

filename

sppas.src.annotations.TextNorm.num2letter.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Numerical to string.

Module to convert numbers to their written form the multilingual text normalization system. Num2Letter conversion is language-specific.

class annotations.TextNorm.num2letter.sppasNum(lang='und')[source]

Bases: object

Numerical conversion using a multilingual algorithm.

The language names used in this class are based on iso639-3.

>>> num = sppasNum('fra')
>>> num.convert("3")
trois
>>> num.convert("03")
>>>zéro-trois
>>> sppasNum('3.0')
ValueError

Notice that this class should be fully re-implemented. It should use an external resource file to make the match between numbers and letters, for each language:

0 zéro 1 un … 10 dix 100 cent 1000 mille 1000000 million 1000000000 milliard

LANGUAGES = ['und', 'yue', 'cmn', 'fra', 'ita', 'eng', 'spa', 'khm', 'vie', 'jpn', 'pol', 'por', 'pcm']
ZERO = {'cmn': '零', 'eng': 'zero', 'fra': 'zéro', 'ita': 'zero', 'jpn': 'ゼロ', 'khm': 'ស្សូន  ', 'pol': 'zerowej', 'por': 'zero', 'spa': 'cero', 'und': '0', 'vie': 'không', 'yue': '零'}
__init__(lang='und')[source]

Create a new sppasNum instance.

Parameters

lang – (str) the language code in ISO639-3 (fra, eng, spa,

khm, ita, …). If lang is set to “und” (undetermined), no conversion is performed.

centaine(number)[source]

Convert a number from 100 to 999.

Parameters

number – (int)

convert(number)[source]

Convert a number to a string. Example: 23 => twenty-three

Parameters

number – (int) A numerical representation

Returns

string corresponding to the given number

Raises

ValueError

dizaine(number)[source]

Convert a number from 10 to 99.

Parameters

number – (int)

get_lang()[source]

Return the current language code.

milliers(number)[source]

Convert a number from 1000 to 9999.

Parameters

number – (int)

millions(number)[source]

Convert a number from 1000 to 1000000.

set_lang(lang)[source]

Set the language.

Parameters

lang – (str) the language code in ISO639-3.

unite(number)[source]

Convert a number from 0 to 9.

Parameters

number – (int) the number to convert to letters.

zero()[source]

Convert the zero number.

Parameters

number – (int) the number to convert to letters.

annotations.TextNorm.orthotranscription module

filename

sppas.src.annotations.TextNorm.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Manage an enriched orthographic transcription.

class annotations.TextNorm.orthotranscription.sppasOrthoTranscription[source]

Bases: object

Manager of an orthographic transcription.

This is a totally language-independent class. It supports the orthographic transcription defined into SPPAS software tool.

From the manual Enriched Orthographic Transcription, two derived ortho. transcriptions are generated automatically by the tokenizer: the “standard” transcription (the list of orthographic tokens); the “faked spelling” that is a specific transcription from which the obtained phonetic tokens are used by the phonetization system.

The following illustrates an utterance text normalization in French:

  • Transcription:

j’ai on a j’ai p- (en)fin j’ai trouvé l(e) meilleur moyen c’était d(e) [loger,locher] chez des amis (English translation is: I’ve we’ve I’ve - well I found the best way was to live in friends’ apartment’)

  • Result of the standard tokens:

j’ ai on a j’ ai p- enfin j’ ai trouvé le meilleur moyen c’ était de loger chez des amis

  • Result of the faked tokens:

j’ ai on a j’ ai p- fin j’ ai trouvé l meilleur moyen c’ était d loche chez des amis

__init__()[source]
static clean_toe(entry)[source]

Clean Enriched Orthographic Transcription.

The convention includes information that must be removed.

Parameters

entry – (str)

Returns

(str)

static toe_spelling(entry, std=False)[source]

Create a specific spelling from an Enriched Orthographic Transcription.

Parameters
  • entry – (str) the EOT string

  • std – (bool) Standard spelling expected instead of the Faked one.

Returns

(str)

DevNote: Python’s regular expression engine supports Unicode. It can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for w (and s, and ), use the “(?u)” flag prefix, or the re.UNICODE flag.

annotations.TextNorm.splitter module

filename

sppas.src.annotations.TextNorm.splitter.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Split step of the normalization automatic annotation.

class annotations.TextNorm.splitter.sppasSimpleSplitter(lang, dict_replace=None, speech=True)[source]

Bases: object

Utterance splitter

Module to split a string for the multilingual text normalization system. Split an utterance into tokens using whitespace or characters.

Should be extended to properly split telephone numbers or dates, etc. (for written texts).

__init__(lang, dict_replace=None, speech=True)[source]

Creates a sppasSimpleSplitter.

Parameters
  • lang – the language code in iso639-3.

  • dict_replace – Replacement dictionary

  • speech – (bool) split transcribed speech vs written text

split(utt)[source]

Split an utterance using whitespace.

If the language is character-based, split each character.

Parameters
  • utt – (str) an utterance of a transcription, a sentence, …

  • std – (bool)

Returns

A list (array of string)

split_characters(utt)[source]

Split an utterance by characters.

Parameters

utt – (str) the utterance (a transcription, a sentence, …) in utf-8

Returns

A string (split character by character, using whitespace)

annotations.TextNorm.sppastextnorm module

filename

sppas.src.annotations.TextNorm.sppastextnorm.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

SPPAS integration of Text Normalization automatic annotation.

class annotations.TextNorm.sppastextnorm.sppasTextNorm(log=None)[source]

Bases: annotations.baseannot.sppasBaseAnnotation

Text normalization automatic annotation.

__init__(log=None)[source]

Create a sppasTextNorm instance without any linguistic resources.

Parameters

log – (sppasLog) Human-readable logs.

convert(tier)[source]

Text normalization of all labels of a tier.

Parameters

tier – (sppasTier) the orthographic transcription (standard or EOT)

Returns

A tuple with 3 tiers named: - “Tokens-Faked”, - “Tokens-Std”, - “Tokens-Custom”

fix_options(options)[source]

Fix all options. Available options are:

  • faked

  • std

  • custom

Parameters

options – (sppasOption)

get_inputs(input_files)[source]

Return the the tier with aligned tokens.

Parameters

input_files – (list)

Raise

NoTierInputError

Returns

(sppasTier)

get_output_pattern()[source]

Pattern this annotation uses in an output filename.

load_resources(vocab_filename, lang='und', **kwargs)[source]

Fix the list of words of a given language.

It allows a better tokenization, and enables the language-dependent modules like num2letters.

Parameters
  • vocab_filename – (str) File with the orthographic transcription

  • lang – (str) the language code

occ_dur(tier)[source]

Create a tier with labels and duration of each annotation.

Parameters

tier

run(input_files, output=None)[source]

Run the automatic annotation process on an input.

Parameters
  • input_files – (list of str) orthographic transcription

  • output – (str) the output file name

Returns

(sppasTranscription)

set_custom(value)[source]

Fix the custom option.

Parameters

value – (bool) Create a customized tokenization

set_faked(value)[source]

Fix the faked option.

Parameters

value – (bool) Create a faked tokenization

set_occ_dur(value)[source]

Fix the occurrences and duration tiers generation option.

Parameters

value – (bool) Create a tier with nb of tokens and duration

set_std(value)[source]

Fix the std option.

Parameters

value – (bool) Create a standard tokenization

annotations.TextNorm.tokenize module

filename

sppas.src.annotations.TextNorm.tokenize.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Tokenization module for the multilingual text norm system.

class annotations.TextNorm.tokenize.sppasTokenSegmenter(vocab=None)[source]

Bases: object

Create words from tokens on the basis of a lexicon.

This is a totally language independent method, based on a longest matching algorithm to aggregate tokens into words. Words of a lexicon are found and:

1/ unbind or not if they contain a separator character:

  • rock’n’roll -> rock’n’roll

  • I’m -> I ‘m

  • it’s -> it ‘s

2/ bind using a character separator like for example, with ‘_’:

  • parce que -> parce_que

  • rock’n roll -> rock’n_roll

SEPARATOR = '_'
STICK_MAX = 7
__init__(vocab=None)[source]

Create a new sppasTokenSegmenter instance.

Parameters

vocab – (Vocabulary)

bind(utt)[source]

Bind tokens of an utterance using a specific character.

Parameters

utt – (list) List of tokens of an utterance (a transcription, a sentence, …)

Returns

A list of strings

set_aggregate_max(value=7)[source]

Fix the maximum number of words to stick.

This is a language dependant value. For French, it’s 5 with the word: “au fur et à mesure”. But it can be more to stick phrases instead of words for example.

Parameters

value – (int) Maximum number of tokens to aggregate/stick.

set_separator(char='_')[source]

Fix the character to separate tokens.

Parameters

char – (char) Separator character. Can be an empty string.

unbind(utt)[source]

Unbind tokens containing - or ‘ or . depending on rules.

Parameters

utt – (list) List of tokens of an utterance (a transcription, a sentence, …)

Returns

A list of strings

Module contents

filename

sppas.src.annotations.TextNorm.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Text Normalization automatic annotation.

The creation of text corpora requires a sequence of processing steps in order to constitute, normalize, and then to directly exploit it by a given application. This package implements a generic approach for text normalization that can be applied on a multipurpose multilingual text or transcribed corpus. It consists in splitting the text normalization problem in a set of minor sub-problems as language-independent as possible. The portability to a new language consists of heritage of all language independent methods and rapid adaptation of other language dependent methods or classes.

For details, read the following reference:

Brigitte Bigi (2011).
A Multilingual Text Normalization Approach.
2nd Less-Resourced Languages workshop,
5th Language & Technology Conference, Poznan (Poland).
class annotations.TextNorm.TextNormalizer(vocab=None, lang='und')[source]

Bases: object

Multilingual text normalization

__init__(vocab=None, lang='und')[source]

Create a TextNormalizer instance.

Parameters
  • vocab – (sppasVocabulary)

  • lang – the language code in iso639-3.

get_vocab_filename()[source]
lower(utt)[source]

Lower a list of strings.

Parameters

utt – (list)

normalize(entry, actions=[])[source]

Tokenize an utterance.

Parameters
  • entry – (str) the string to normalize

  • actions

    (list) the modules/options to enable.

    • ”std”: generated the standard orthography instead of the faked one

    • ”replace”: use a replacement dictionary

    • ”tokenize”: tokenize the entry

    • ”numbers”: convert numbers to their written form

    • ”lower”: change case of characters to lower

    • ”punct”: remove punctuation

Returns

(str) the list of normalized tokens

Important: An empty actions list or a list containing only “std” means to enable all actions.

numbers(utt)[source]

Convert numbers to their written form.

Parameters

utt – (list)

Returns

(list)

remove(utt, wlist)[source]

Remove data of an utterance if included in a dictionary.

Only used to remove punctuation.

Parameters
  • utt – (list)

  • wlist – (WordList)

replace(utt)[source]

Examine tokens and performs some replacements.

A dictionary with symbols contains the replacements to operate.

Parameters

utt – (list) the utterance

Returns

A list of strings

set_delim(delim)[source]

Set the delimiter, used to separate tokens.

Parameters

delim – (str) a unicode character.

set_lang(lang)[source]

Set the language.

Parameters

lang – (str) the language code in iso639-3 (fra, eng, vie…).

set_num(num_dict)[source]

Set the dictionary of numbers.

Parameters

num_dict – (sppasDictRepl)

set_punct(punct)[source]

Set the list of punctuation.

Parameters

punct – (sppasVocabulary)

set_repl(repl)[source]

Set the dictionary of replacements.

Parameters

repl – (sppasDictRepl)

set_vocab(vocab)[source]

Set the lexicon.

Parameters

vocab – (sppasVocabulary).

tokenize(utt)[source]

Tokenize is the text segmentation, ie segment into tokens.

Parameters

utt – (list)

Returns

(list)

static variants(utt)[source]

Convert strings that are variants in the utterance.

Parameters

utt – (list)

class annotations.TextNorm.sppasOrthoTranscription[source]

Bases: object

Manager of an orthographic transcription.

This is a totally language-independent class. It supports the orthographic transcription defined into SPPAS software tool.

From the manual Enriched Orthographic Transcription, two derived ortho. transcriptions are generated automatically by the tokenizer: the “standard” transcription (the list of orthographic tokens); the “faked spelling” that is a specific transcription from which the obtained phonetic tokens are used by the phonetization system.

The following illustrates an utterance text normalization in French:

  • Transcription:

j’ai on a j’ai p- (en)fin j’ai trouvé l(e) meilleur moyen c’était d(e) [loger,locher] chez des amis (English translation is: I’ve we’ve I’ve - well I found the best way was to live in friends’ apartment’)

  • Result of the standard tokens:

j’ ai on a j’ ai p- enfin j’ ai trouvé le meilleur moyen c’ était de loger chez des amis

  • Result of the faked tokens:

j’ ai on a j’ ai p- fin j’ ai trouvé l meilleur moyen c’ était d loche chez des amis

__init__()[source]
static clean_toe(entry)[source]

Clean Enriched Orthographic Transcription.

The convention includes information that must be removed.

Parameters

entry – (str)

Returns

(str)

static toe_spelling(entry, std=False)[source]

Create a specific spelling from an Enriched Orthographic Transcription.

Parameters
  • entry – (str) the EOT string

  • std – (bool) Standard spelling expected instead of the Faked one.

Returns

(str)

DevNote: Python’s regular expression engine supports Unicode. It can apply the same pattern to either 8-bit (encoded) or Unicode strings. To create a regular expression pattern that uses Unicode character classes for w (and s, and ), use the “(?u)” flag prefix, or the re.UNICODE flag.

class annotations.TextNorm.sppasSimpleSplitter(lang, dict_replace=None, speech=True)[source]

Bases: object

Utterance splitter

Module to split a string for the multilingual text normalization system. Split an utterance into tokens using whitespace or characters.

Should be extended to properly split telephone numbers or dates, etc. (for written texts).

__init__(lang, dict_replace=None, speech=True)[source]

Creates a sppasSimpleSplitter.

Parameters
  • lang – the language code in iso639-3.

  • dict_replace – Replacement dictionary

  • speech – (bool) split transcribed speech vs written text

split(utt)[source]

Split an utterance using whitespace.

If the language is character-based, split each character.

Parameters
  • utt – (str) an utterance of a transcription, a sentence, …

  • std – (bool)

Returns

A list (array of string)

split_characters(utt)[source]

Split an utterance by characters.

Parameters

utt – (str) the utterance (a transcription, a sentence, …) in utf-8

Returns

A string (split character by character, using whitespace)

class annotations.TextNorm.sppasTextNorm(log=None)[source]

Bases: annotations.baseannot.sppasBaseAnnotation

Text normalization automatic annotation.

__init__(log=None)[source]

Create a sppasTextNorm instance without any linguistic resources.

Parameters

log – (sppasLog) Human-readable logs.

convert(tier)[source]

Text normalization of all labels of a tier.

Parameters

tier – (sppasTier) the orthographic transcription (standard or EOT)

Returns

A tuple with 3 tiers named: - “Tokens-Faked”, - “Tokens-Std”, - “Tokens-Custom”

fix_options(options)[source]

Fix all options. Available options are:

  • faked

  • std

  • custom

Parameters

options – (sppasOption)

get_inputs(input_files)[source]

Return the the tier with aligned tokens.

Parameters

input_files – (list)

Raise

NoTierInputError

Returns

(sppasTier)

get_output_pattern()[source]

Pattern this annotation uses in an output filename.

load_resources(vocab_filename, lang='und', **kwargs)[source]

Fix the list of words of a given language.

It allows a better tokenization, and enables the language-dependent modules like num2letters.

Parameters
  • vocab_filename – (str) File with the orthographic transcription

  • lang – (str) the language code

occ_dur(tier)[source]

Create a tier with labels and duration of each annotation.

Parameters

tier

run(input_files, output=None)[source]

Run the automatic annotation process on an input.

Parameters
  • input_files – (list of str) orthographic transcription

  • output – (str) the output file name

Returns

(sppasTranscription)

set_custom(value)[source]

Fix the custom option.

Parameters

value – (bool) Create a customized tokenization

set_faked(value)[source]

Fix the faked option.

Parameters

value – (bool) Create a faked tokenization

set_occ_dur(value)[source]

Fix the occurrences and duration tiers generation option.

Parameters

value – (bool) Create a tier with nb of tokens and duration

set_std(value)[source]

Fix the std option.

Parameters

value – (bool) Create a standard tokenization

class annotations.TextNorm.sppasTokenSegmenter(vocab=None)[source]

Bases: object

Create words from tokens on the basis of a lexicon.

This is a totally language independent method, based on a longest matching algorithm to aggregate tokens into words. Words of a lexicon are found and:

1/ unbind or not if they contain a separator character:

  • rock’n’roll -> rock’n’roll

  • I’m -> I ‘m

  • it’s -> it ‘s

2/ bind using a character separator like for example, with ‘_’:

  • parce que -> parce_que

  • rock’n roll -> rock’n_roll

SEPARATOR = '_'
STICK_MAX = 7
__init__(vocab=None)[source]

Create a new sppasTokenSegmenter instance.

Parameters

vocab – (Vocabulary)

bind(utt)[source]

Bind tokens of an utterance using a specific character.

Parameters

utt – (list) List of tokens of an utterance (a transcription, a sentence, …)

Returns

A list of strings

set_aggregate_max(value=7)[source]

Fix the maximum number of words to stick.

This is a language dependant value. For French, it’s 5 with the word: “au fur et à mesure”. But it can be more to stick phrases instead of words for example.

Parameters

value – (int) Maximum number of tokens to aggregate/stick.

set_separator(char='_')[source]

Fix the character to separate tokens.

Parameters

char – (char) Separator character. Can be an empty string.

unbind(utt)[source]

Unbind tokens containing - or ‘ or . depending on rules.

Parameters

utt – (list) List of tokens of an utterance (a transcription, a sentence, …)

Returns

A list of strings