annotations.SelfRepet package

Submodules

annotations.SelfRepet.datastructs module

filename

sppas.src.annotations.SelfRepet.datastructs.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Data structure to store a source and its echos.

class annotations.SelfRepet.datastructs.DataRepetition(s1=None, s2=None, r1=None, r2=None)[source]

Bases: object

Class to store one repetition (the source and the echos).

The source of a repetition is represented as a tuple (start, end). The echos of this latter are stored as a list of tuples (start, end).

__init__(s1=None, s2=None, r1=None, r2=None)[source]

Create a DataRepetition data structure.

Parameters
  • s1 – start position of the source.

  • s2 – end position of the source.

  • r1 – start position of an echo

  • r2 – end position of an echo

add_echo(start, end)[source]

Add an entry in the list of echos.

Parameters
  • start – Start position of the echo.

  • end – End position of the source.

Raises

ValueError

get_echos()[source]

Return the list of echos.

get_source()[source]

Return the tuple (start, end) of the source.

reset()[source]

Fix the source to None and the echos to an empty list.

set_source(start, end)[source]

Set the position of the source.

Setting the position of the source automatically resets the echos because it’s not correct to change the source of existing echos.

Parameters
  • start – Start position of the source

  • end – End position of the source

Raises

ValueError, IndexError

class annotations.SelfRepet.datastructs.DataSpeaker(tokens)[source]

Bases: object

Class to store data of a speaker.

Stored data are a list of formatted unicode strings.

__init__(tokens)[source]

Create a DataSpeaker instance.

Parameters

tokens – (list) List of tokens.

get_next_word(current)[source]

Ask for the index of the next word in entries.

:param current (int) Current position to search for the next word :returns: (int) Index of the next word or -1 if no next word can be found.

is_word(idx)[source]

Return true if the entry at the given index is a word.

An empty entry is not a word. Symbols (silences, laughs…) are not words. Hesitations are considered words.

Return False if the given index is wrong.

Parameters

idx – (int) Index of the entry to get

Returns

(bool)

is_word_repeated(current, other_current, other_speaker)[source]

Ask for a token to be a repeated word.

Parameters
  • current – (int) From index, in current speaker

  • other_current – (int) From index, in the other speaker

  • other_speaker – (DataSpeaker) Data of the other speaker

Returns

index of the echo or -1

class annotations.SelfRepet.datastructs.Entry(entry)[source]

Bases: object

Class to store a formatted unicode entry.

__init__(entry)[source]

Create an Entry instance.

Parameters

entry – (str, unicode)

get()[source]

Return the formatted unicode entry.

set(entry)[source]

Fix the entry.

Parameters

entry – (str, unicode) entry to store.

annotations.SelfRepet.detectrepet module

filename

sppas.src.annotations.SelfRepet.detectrepet.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Detect Self-Repetition of a speaker.

class annotations.SelfRepet.detectrepet.SelfRepetition(stop_list=None)[source]

Bases: annotations.SelfRepet.datastructs.DataRepetition

Self-Repetition automatic detection.

Search for the sources, then find where are the echos.

__init__(stop_list=None)[source]

Create a new SelfRepetitions instance.

Parameters

stop_list – (StopWords) List of un-relevant tokens.

detect(speaker, limit=10)[source]

Search for the first self-repetition in tokens.

Parameters
  • speaker – (DataSpeaker) All the data of speaker

  • limit – (int) Go no longer than ‘limit’ entries in speaker data

find_echos(start, end, speaker)[source]

Find all echos of a source.

Parameters
  • start – (int) start index of the entry of the source (speaker)

  • end – (int) end index of the entry of the source (speaker)

  • speaker – (DataSpeaker) All data of speaker

Returns

DataRepetition()

static get_longest(current, speaker)[source]

Return the index of the last token of the longest repeated string.

Parameters
  • current – (int) Current index in entries of speaker data

  • speaker – (DataSpeaker) All the data of speaker

Returns

(int) Index or -1

select(start, end, speaker)[source]

Append (or not) a self-repetition.

Parameters
  • start – (int) start index of the entry of the source (speaker)

  • end – (int) end index of the entry of the source (speaker)

  • speaker – (DataSpeaker) Entries of speaker

annotations.SelfRepet.rules module

filename

sppas.src.annotations.SelfRepet.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Self-Repetition rules to accept/reject a candidate.

class annotations.SelfRepet.rules.SelfRules(stop_list=None)[source]

Bases: object

Rules to select self-repetitions.

Proposed rules deal with the number of words, the word frequencies and distinguishes if the repetition is strict or not. The following rules are proposed for other-repetitions:

  • Rule 1: A source is accepted if it contains one or more relevant

token. Relevance depends on the speaker producing the echo; - Rule 2: A source which contains at least K tokens is accepted if the repetition is strict.

Rule number 1 need to fix a clear definition of the relevance of a token. Un-relevant tokens are then stored in a stop-list. The stop-list also should contain very frequent tokens in the given language like adjectives, pronouns, etc.

__init__(stop_list=None)[source]

Create a SelfRules instance.

Parameters

stop_list – (sppasVocabulary or list) Un-relevant tokens.

count_relevant_tokens(start, end, speaker)[source]

Count the number of relevant words from start to end (included).

Parameters
  • start – (int) Index to start to count

  • end – (int) Index to stop to count

  • speaker – (DataSpeaker) All the data

Returns

(int)

is_relevant(idx, speaker)[source]

Ask for the entry of a speaker to be relevant or not.

An entry is considered relevant if:

  1. It is not a silence, a pause, a laugh, dummy or a noise;

  2. It is not in the stop-list.

Parameters
  • idx – (str) Index of the data to be checked

  • speaker – (DataSpeaker) All the data

Returns

(bool)

rule_one_token(current, speaker)[source]

Check whether one token is a self-repetition or not.

Rules are:

  • the token must be a word, and not in the stop-list;

  • the token must be repeated.

Parameters
  • current – (int) Index of the token to check

  • speaker – (DataSpeaker) All the data

Returns

(bool)

rule_syntagme(start, end, speaker)[source]

Apply rule 1 to decide if selection is a repetition or not.

Rule 1: The selection of tokens of speaker 1 must contain at least one relevant token for speaker 2.

Parameters
  • start – (int) Index to start the selection

  • end – (int) Index to stop the selection

  • speaker – (DataSpeaker) All the data

Returns

(bool)

annotations.SelfRepet.sppasbaserepet module

filename

sppas.src.annotations.SelfRepet.sppasbaserepet.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Base class for SPPAS integration of repetitions detection.

class annotations.SelfRepet.sppasbaserepet.sppasBaseRepet(config, log=None)[source]

Bases: annotations.baseannot.sppasBaseAnnotation

SPPAS Automatic Any-Repetition Detection.

__init__(config, log=None)[source]

Create a new sppasRepetition instance.

Log is used for a better communication of the annotation process and its results. If None, logs are redirected to the default logging system.

Parameters
  • config – (str) Name of the JSON configuration file, without path.

  • log – (sppasLog) Human-readable logs.

fix_options(options)[source]

Fix all options.

Parameters

options – list of sppasOption instances

load_resources(lang_resources, lang=None)[source]

Load a list of stop-words and replacements.

Override the existing loaded lists…

Parameters
  • lang_resources – (str) File with extension ‘.stp’ or ‘.lem’ or nothing

  • lang – (str)

make_stop_words(tier)[source]

Return a tier indicating if entries are stop-words.

Parameters

tier – (sppasTier) Time-aligned tokens.

make_word_strain(tier)[source]

Return a tier with modified tokens.

Parameters

tier – (sppasTier) Time-aligned tokens.

set_alpha(alpha)[source]

Fix the alpha option.

Alpha is a coefficient to add specific stop-words in the list.

Parameters

alpha – (float)

set_span(span)[source]

Fix the span option.

Span is the maximum number of IPUs to search for repetitions. A value of 1 means to search only in the current IPU.

Parameters

span – (int)

set_use_stopwords(use_stopwords)[source]

Fix the use_stopwords option.

If use_stopwords is set to True, sppasRepetition() will add specific stopwords to the stopwords list (deducted from the input text).

Parameters

use_stopwords – (bool)

annotations.SelfRepet.sppasrepet module

filename

sppas.src.annotations.SelfRepet.sppasrepet.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

SPPAS integration of Self-Repetitiond automatic annotation

class annotations.SelfRepet.sppasrepet.sppasSelfRepet(log=None)[source]

Bases: annotations.SelfRepet.sppasbaserepet.sppasBaseRepet

SPPAS Automatic Self-Repetition Detection.

Detect self-repetitions. The result has never been validated by an expert. This annotation is performed on the basis of time-aligned tokens or lemmas. The output is made of 2 tiers with sources and echos.

__init__(log=None)[source]

Create a new sppasRepetition instance.

Parameters

log – (sppasLog) Human-readable logs.

get_input_pattern()[source]

Pattern this annotation expects for its input filename.

get_output_pattern()[source]

Pattern this annotation uses in an output filename.

run(input_files, output=None)[source]

Run the automatic annotation process on an input.

Parameters
  • input_files – (list of str) Time-aligned tokens

  • output – (str) the output file name

Returns

(sppasTranscription)

self_detection(tier)[source]

Self-Repetition detection.

Parameters

tier – (sppasTier)

Module contents

filename

sppas.src.annotations.SelfRepet.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Self-Repetition detection.

class annotations.SelfRepet.DataSpeaker(tokens)[source]

Bases: object

Class to store data of a speaker.

Stored data are a list of formatted unicode strings.

__init__(tokens)[source]

Create a DataSpeaker instance.

Parameters

tokens – (list) List of tokens.

get_next_word(current)[source]

Ask for the index of the next word in entries.

:param current (int) Current position to search for the next word :returns: (int) Index of the next word or -1 if no next word can be found.

is_word(idx)[source]

Return true if the entry at the given index is a word.

An empty entry is not a word. Symbols (silences, laughs…) are not words. Hesitations are considered words.

Return False if the given index is wrong.

Parameters

idx – (int) Index of the entry to get

Returns

(bool)

is_word_repeated(current, other_current, other_speaker)[source]

Ask for a token to be a repeated word.

Parameters
  • current – (int) From index, in current speaker

  • other_current – (int) From index, in the other speaker

  • other_speaker – (DataSpeaker) Data of the other speaker

Returns

index of the echo or -1

class annotations.SelfRepet.SelfRepetition(stop_list=None)[source]

Bases: annotations.SelfRepet.datastructs.DataRepetition

Self-Repetition automatic detection.

Search for the sources, then find where are the echos.

__init__(stop_list=None)[source]

Create a new SelfRepetitions instance.

Parameters

stop_list – (StopWords) List of un-relevant tokens.

detect(speaker, limit=10)[source]

Search for the first self-repetition in tokens.

Parameters
  • speaker – (DataSpeaker) All the data of speaker

  • limit – (int) Go no longer than ‘limit’ entries in speaker data

find_echos(start, end, speaker)[source]

Find all echos of a source.

Parameters
  • start – (int) start index of the entry of the source (speaker)

  • end – (int) end index of the entry of the source (speaker)

  • speaker – (DataSpeaker) All data of speaker

Returns

DataRepetition()

static get_longest(current, speaker)[source]

Return the index of the last token of the longest repeated string.

Parameters
  • current – (int) Current index in entries of speaker data

  • speaker – (DataSpeaker) All the data of speaker

Returns

(int) Index or -1

select(start, end, speaker)[source]

Append (or not) a self-repetition.

Parameters
  • start – (int) start index of the entry of the source (speaker)

  • end – (int) end index of the entry of the source (speaker)

  • speaker – (DataSpeaker) Entries of speaker

class annotations.SelfRepet.SelfRules(stop_list=None)[source]

Bases: object

Rules to select self-repetitions.

Proposed rules deal with the number of words, the word frequencies and distinguishes if the repetition is strict or not. The following rules are proposed for other-repetitions:

  • Rule 1: A source is accepted if it contains one or more relevant

token. Relevance depends on the speaker producing the echo; - Rule 2: A source which contains at least K tokens is accepted if the repetition is strict.

Rule number 1 need to fix a clear definition of the relevance of a token. Un-relevant tokens are then stored in a stop-list. The stop-list also should contain very frequent tokens in the given language like adjectives, pronouns, etc.

__init__(stop_list=None)[source]

Create a SelfRules instance.

Parameters

stop_list – (sppasVocabulary or list) Un-relevant tokens.

count_relevant_tokens(start, end, speaker)[source]

Count the number of relevant words from start to end (included).

Parameters
  • start – (int) Index to start to count

  • end – (int) Index to stop to count

  • speaker – (DataSpeaker) All the data

Returns

(int)

is_relevant(idx, speaker)[source]

Ask for the entry of a speaker to be relevant or not.

An entry is considered relevant if:

  1. It is not a silence, a pause, a laugh, dummy or a noise;

  2. It is not in the stop-list.

Parameters
  • idx – (str) Index of the data to be checked

  • speaker – (DataSpeaker) All the data

Returns

(bool)

rule_one_token(current, speaker)[source]

Check whether one token is a self-repetition or not.

Rules are:

  • the token must be a word, and not in the stop-list;

  • the token must be repeated.

Parameters
  • current – (int) Index of the token to check

  • speaker – (DataSpeaker) All the data

Returns

(bool)

rule_syntagme(start, end, speaker)[source]

Apply rule 1 to decide if selection is a repetition or not.

Rule 1: The selection of tokens of speaker 1 must contain at least one relevant token for speaker 2.

Parameters
  • start – (int) Index to start the selection

  • end – (int) Index to stop the selection

  • speaker – (DataSpeaker) All the data

Returns

(bool)

class annotations.SelfRepet.sppasSelfRepet(log=None)[source]

Bases: annotations.SelfRepet.sppasbaserepet.sppasBaseRepet

SPPAS Automatic Self-Repetition Detection.

Detect self-repetitions. The result has never been validated by an expert. This annotation is performed on the basis of time-aligned tokens or lemmas. The output is made of 2 tiers with sources and echos.

__init__(log=None)[source]

Create a new sppasRepetition instance.

Parameters

log – (sppasLog) Human-readable logs.

get_input_pattern()[source]

Pattern this annotation expects for its input filename.

get_output_pattern()[source]

Pattern this annotation uses in an output filename.

run(input_files, output=None)[source]

Run the automatic annotation process on an input.

Parameters
  • input_files – (list of str) Time-aligned tokens

  • output – (str) the output file name

Returns

(sppasTranscription)

self_detection(tier)[source]

Self-Repetition detection.

Parameters

tier – (sppasTier)