SPPAS 4.22

https://sppas.org/

Module sppas.src.annotations

Class StopWords

Description

A vocabulary that can automatically evaluate a list of Stop-Words.

An entry 'w' is relevant for the speaker if its probability is less than a threshold:

| P(w) <= 1 / (alpha * V)

where 'alpha' is an empirical coefficient and 'V' is the vocabulary size of the speaker.

Constructor

Create a new StopWords instance.

Parameters
  • case_sensitive: (bool) Considers the case of entries or not.
View Source
def __init__(self, case_sensitive=False):
    """Create a new StopWords instance.

    :param case_sensitive: (bool) Considers the case of entries or not.

    """
    super(StopWords, self).__init__(filename=None, nodump=True, case_sensitive=case_sensitive)
    self.__alpha = 0.5
    self.__threshold = 0.0
    self.__v = 0.0

Public functions

get_alpha

Return the value of alpha coefficient (float).

View Source
def get_alpha(self):
    """Return the value of alpha coefficient (float)."""
    return self.__alpha

get_threshold

Return the last estimated threshold (float).

View Source
def get_threshold(self):
    """Return the last estimated threshold (float)."""
    return self.__threshold

get_v

Return the last estimated vocabulary size (int).

View Source
def get_v(self):
    """Return the last estimated vocabulary size (int)."""
    return self.__v

set_alpha

Fix the alpha option.

Alpha is a coefficient to add specific stop-words in the list. Default value is 0.5.

Parameters
  • alpha: (float) Value in range [0..4]
View Source
def set_alpha(self, alpha):
    """Fix the alpha option.

        Alpha is a coefficient to add specific stop-words in the list.
        Default value is 0.5.

        :param alpha: (float) Value in range [0..4]

        """
    alpha = float(alpha)
    if 0.0 < alpha <= self.MAX_ALPHA:
        self.__alpha = alpha
    else:
        raise IndexRangeException(alpha, 0, StopWords.MAX_ALPHA)

copy

Make a deep copy of the instance.

Returns
  • (StopWords)
View Source
def copy(self):
    """Make a deep copy of the instance.

        :returns: (StopWords)

        """
    s = StopWords()
    for i in self:
        s.add(i)
    s.set_alpha(self.__alpha)
    return s

load

Load a list of stop-words from a file.

Parameters
  • filename: (str)
  • merge: (bool) Merge with the existing list (if True) or delete the existing list (if False)
View Source
def load(self, filename, merge=True):
    """Load a list of stop-words from a file.

        :param filename: (str)
        :param merge: (bool) Merge with the existing list (if True) or
        delete the existing list (if False)

        """
    if merge is False:
        self.clear()
    self.load_from_ascii(filename)

evaluate

Add entries to the list of stop-words from the content of a tier.

Estimate if a token is relevant: if not it adds it in the stop-list.

Parameters
  • tier: (sppasTier) A tier with entries to be analyzed.
  • merge: (bool) Merge with the existing list (if True) or delete the existing list and create a new one (if False)
Returns
  • (int) Number of entries added into the list
Raises

EmptyInputError, TooSmallInputError

View Source
def evaluate(self, tier=None, merge=True):
    """Add entries to the list of stop-words from the content of a tier.

        Estimate if a token is relevant: if not it adds it in the stop-list.

        :param tier: (sppasTier) A tier with entries to be analyzed.
        :param merge: (bool) Merge with the existing list (if True) or
        delete the existing list and create a new one (if False)
        :returns: (int) Number of entries added into the list
        :raises: EmptyInputError, TooSmallInputError

        """
    if tier is None or tier.is_empty():
        raise EmptyInputError(tier.get_name())
    if len(tier) < StopWords.MIN_ANN_NUMBER:
        raise TooSmallInputError(tier.get_name())
    unigram = sppasUnigram()
    for ann in tier:
        for label in ann.get_labels():
            tag = label.get_best()
            content = tag.get_content()
            if content not in symbols.all:
                unigram.add(content)
    self.__v = len(unigram)
    self.__threshold = 1.0 / (self.__alpha * float(self.__v))
    if merge is False:
        self.clear()
    usum = float(unigram.get_sum())
    nb = 0
    for token in unigram.get_tokens():
        p_w = float(unigram.get_count(token)) / usum
        if p_w > self.__threshold:
            self.add(token)
            nb += 1
    return nb