annotations.StopWords package

Submodules

annotations.StopWords.sppasstpwds module

filename

sppas.src.annotations.StopWords.sppaswtpwds.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

SPPAS integration of the StopWords automatic annotation.

class annotations.StopWords.sppasstpwds.sppasStopWords(log=None)[source]

Bases: annotations.baseannot.sppasBaseAnnotation

SPPAS integration of the identification of stop words in a tier.

__init__(log=None)[source]

Create a new instance.

Parameters

log – (sppasLog) Human-readable logs.

fix_options(options)[source]

Fix all options.

Parameters

options – list of sppasOption instances

get_inputs(input_files)[source]

Return the the tier with aligned tokens.

Parameters

input_files – (list)

Raise

NoTierInputError

Returns

(sppasTier)

get_output_pattern()[source]

Pattern this annotation uses in an output filename.

load_resources(lang_resources, lang=None)[source]

Load a list of stop-words and replacements.

Override the existing loaded lists…

Parameters
  • lang_resources – (str) File with extension ‘.stp’ or nothing

  • lang – (str)

make_stp_tier(tier)[source]

Return a tier indicating if entries are stop-words.

Parameters

tier – (sppasTier)

run(input_files, output=None)[source]

Run the automatic annotation process on an input.

Parameters
  • input_files – (list of str) Time-aligned tokens

  • output – (str) the output file name

Returns

(sppasTranscription)

set_alpha(alpha)[source]

Fix the alpha option.

Alpha is a coefficient to add specific stop-words in the list.

Parameters

alpha – (float)

set_tiername(tier_name)[source]

Fix the tiername option.

Parameters

tier_name – (str)

annotations.StopWords.stpwds module

filename

sppas.src.annotations.StopWords.stpwds.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Stopwords detection.

class annotations.StopWords.stpwds.StopWords(case_sensitive=False)[source]

Bases: sppas.src.resources.vocab.sppasVocabulary

A vocabulary that can automatically evaluate a list of Stop-Words.

An entry ‘w’ is relevant for the speaker if its probability is less than a threshold:

P(w) <= 1 / (alpha * V)

where ‘alpha’ is an empirical coefficient and ‘V’ is the vocabulary size of the speaker.

MAX_ALPHA = 4.0
MIN_ANN_NUMBER = 5
__init__(case_sensitive=False)[source]

Create a new StopWords instance.

Parameters

case_sensitive – (bool) Considers the case of entries or not.

property alpha

Return the value of alpha coefficient (float).

copy()[source]

Make a deep copy of the instance.

Returns

(StopWords)

evaluate(tier=None, merge=True)[source]

Add entries to the list of stop-words from the content of a tier.

Estimate if a token is relevant: if not it adds it in the stop-list.

Parameters
  • tier – (sppasTier) A tier with entries to be analyzed.

  • merge – (bool) Merge with the existing list (if True) or

delete the existing list and create a new one (if False) :returns: (int) Number of entries added into the list :raises: EmptyInputError, TooSmallInputError

get_alpha()[source]

Return the value of alpha coefficient (float).

get_threshold()[source]

Return the last estimated threshold (float).

get_v()[source]

Return the last estimated vocabulary size (int).

load(filename, merge=True)[source]

Load a list of stop-words from a file.

Parameters
  • filename – (str)

  • merge – (bool) Merge with the existing list (if True) or

delete the existing list (if False)

set_alpha(alpha)[source]

Fix the alpha option.

Alpha is a coefficient to add specific stop-words in the list. Default value is 0.5.

Parameters

alpha – (float) Value in range [0..4]

Module contents

filename

sppas.src.annotations.StopWords.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Stop-words boolean annotation.

class annotations.StopWords.StopWords(case_sensitive=False)[source]

Bases: sppas.src.resources.vocab.sppasVocabulary

A vocabulary that can automatically evaluate a list of Stop-Words.

An entry ‘w’ is relevant for the speaker if its probability is less than a threshold:

P(w) <= 1 / (alpha * V)

where ‘alpha’ is an empirical coefficient and ‘V’ is the vocabulary size of the speaker.

MAX_ALPHA = 4.0
MIN_ANN_NUMBER = 5
__init__(case_sensitive=False)[source]

Create a new StopWords instance.

Parameters

case_sensitive – (bool) Considers the case of entries or not.

property alpha

Return the value of alpha coefficient (float).

copy()[source]

Make a deep copy of the instance.

Returns

(StopWords)

evaluate(tier=None, merge=True)[source]

Add entries to the list of stop-words from the content of a tier.

Estimate if a token is relevant: if not it adds it in the stop-list.

Parameters
  • tier – (sppasTier) A tier with entries to be analyzed.

  • merge – (bool) Merge with the existing list (if True) or

delete the existing list and create a new one (if False) :returns: (int) Number of entries added into the list :raises: EmptyInputError, TooSmallInputError

get_alpha()[source]

Return the value of alpha coefficient (float).

get_threshold()[source]

Return the last estimated threshold (float).

get_v()[source]

Return the last estimated vocabulary size (int).

load(filename, merge=True)[source]

Load a list of stop-words from a file.

Parameters
  • filename – (str)

  • merge – (bool) Merge with the existing list (if True) or

delete the existing list (if False)

set_alpha(alpha)[source]

Fix the alpha option.

Alpha is a coefficient to add specific stop-words in the list. Default value is 0.5.

Parameters

alpha – (float) Value in range [0..4]

class annotations.StopWords.sppasStopWords(log=None)[source]

Bases: annotations.baseannot.sppasBaseAnnotation

SPPAS integration of the identification of stop words in a tier.

__init__(log=None)[source]

Create a new instance.

Parameters

log – (sppasLog) Human-readable logs.

fix_options(options)[source]

Fix all options.

Parameters

options – list of sppasOption instances

get_inputs(input_files)[source]

Return the the tier with aligned tokens.

Parameters

input_files – (list)

Raise

NoTierInputError

Returns

(sppasTier)

get_output_pattern()[source]

Pattern this annotation uses in an output filename.

load_resources(lang_resources, lang=None)[source]

Load a list of stop-words and replacements.

Override the existing loaded lists…

Parameters
  • lang_resources – (str) File with extension ‘.stp’ or nothing

  • lang – (str)

make_stp_tier(tier)[source]

Return a tier indicating if entries are stop-words.

Parameters

tier – (sppasTier)

run(input_files, output=None)[source]

Run the automatic annotation process on an input.

Parameters
  • input_files – (list of str) Time-aligned tokens

  • output – (str) the output file name

Returns

(sppasTranscription)

set_alpha(alpha)[source]

Fix the alpha option.

Alpha is a coefficient to add specific stop-words in the list.

Parameters

alpha – (float)

set_tiername(tier_name)[source]

Fix the tiername option.

Parameters

tier_name – (str)