annotations.SelfRepet package¶
Submodules¶
annotations.SelfRepet.datastructs module¶
- filename
sppas.src.annotations.SelfRepet.datastructs.py
- author
Brigitte Bigi
- contact
- summary
Data structure to store a source and its echos.
- class annotations.SelfRepet.datastructs.DataRepetition(s1=None, s2=None, r1=None, r2=None)[source]¶
Bases:
object
Class to store one repetition (the source and the echos).
The source of a repetition is represented as a tuple (start, end). The echos of this latter are stored as a list of tuples (start, end).
- __init__(s1=None, s2=None, r1=None, r2=None)[source]¶
Create a DataRepetition data structure.
- Parameters
s1 – start position of the source.
s2 – end position of the source.
r1 – start position of an echo
r2 – end position of an echo
- add_echo(start, end)[source]¶
Add an entry in the list of echos.
- Parameters
start – Start position of the echo.
end – End position of the source.
- Raises
ValueError
- set_source(start, end)[source]¶
Set the position of the source.
Setting the position of the source automatically resets the echos because it’s not correct to change the source of existing echos.
- Parameters
start – Start position of the source
end – End position of the source
- Raises
ValueError, IndexError
- class annotations.SelfRepet.datastructs.DataSpeaker(tokens)[source]¶
Bases:
object
Class to store data of a speaker.
Stored data are a list of formatted unicode strings.
- get_next_word(current)[source]¶
Ask for the index of the next word in entries.
:param current (int) Current position to search for the next word :returns: (int) Index of the next word or -1 if no next word can be found.
- is_word(idx)[source]¶
Return true if the entry at the given index is a word.
An empty entry is not a word. Symbols (silences, laughs…) are not words. Hesitations are considered words.
Return False if the given index is wrong.
- Parameters
idx – (int) Index of the entry to get
- Returns
(bool)
- is_word_repeated(current, other_current, other_speaker)[source]¶
Ask for a token to be a repeated word.
- Parameters
current – (int) From index, in current speaker
other_current – (int) From index, in the other speaker
other_speaker – (DataSpeaker) Data of the other speaker
- Returns
index of the echo or -1
annotations.SelfRepet.detectrepet module¶
- filename
sppas.src.annotations.SelfRepet.detectrepet.py
- author
Brigitte Bigi
- contact
- summary
Detect Self-Repetition of a speaker.
- class annotations.SelfRepet.detectrepet.SelfRepetition(stop_list=None)[source]¶
Bases:
annotations.SelfRepet.datastructs.DataRepetition
Self-Repetition automatic detection.
Search for the sources, then find where are the echos.
- __init__(stop_list=None)[source]¶
Create a new SelfRepetitions instance.
- Parameters
stop_list – (StopWords) List of un-relevant tokens.
- detect(speaker, limit=10)[source]¶
Search for the first self-repetition in tokens.
- Parameters
speaker – (DataSpeaker) All the data of speaker
limit – (int) Go no longer than ‘limit’ entries in speaker data
- find_echos(start, end, speaker)[source]¶
Find all echos of a source.
- Parameters
start – (int) start index of the entry of the source (speaker)
end – (int) end index of the entry of the source (speaker)
speaker – (DataSpeaker) All data of speaker
- Returns
DataRepetition()
annotations.SelfRepet.rules module¶
- filename
sppas.src.annotations.SelfRepet.__init__.py
- author
Brigitte Bigi
- contact
- summary
Self-Repetition rules to accept/reject a candidate.
- class annotations.SelfRepet.rules.SelfRules(stop_list=None)[source]¶
Bases:
object
Rules to select self-repetitions.
Proposed rules deal with the number of words, the word frequencies and distinguishes if the repetition is strict or not. The following rules are proposed for other-repetitions:
Rule 1: A source is accepted if it contains one or more relevant
token. Relevance depends on the speaker producing the echo; - Rule 2: A source which contains at least K tokens is accepted if the repetition is strict.
Rule number 1 need to fix a clear definition of the relevance of a token. Un-relevant tokens are then stored in a stop-list. The stop-list also should contain very frequent tokens in the given language like adjectives, pronouns, etc.
- __init__(stop_list=None)[source]¶
Create a SelfRules instance.
- Parameters
stop_list – (sppasVocabulary or list) Un-relevant tokens.
- count_relevant_tokens(start, end, speaker)[source]¶
Count the number of relevant words from start to end (included).
- Parameters
start – (int) Index to start to count
end – (int) Index to stop to count
speaker – (DataSpeaker) All the data
- Returns
(int)
- is_relevant(idx, speaker)[source]¶
Ask for the entry of a speaker to be relevant or not.
An entry is considered relevant if:
It is not a silence, a pause, a laugh, dummy or a noise;
It is not in the stop-list.
- Parameters
idx – (str) Index of the data to be checked
speaker – (DataSpeaker) All the data
- Returns
(bool)
- rule_one_token(current, speaker)[source]¶
Check whether one token is a self-repetition or not.
Rules are:
the token must be a word, and not in the stop-list;
the token must be repeated.
- Parameters
current – (int) Index of the token to check
speaker – (DataSpeaker) All the data
- Returns
(bool)
- rule_syntagme(start, end, speaker)[source]¶
Apply rule 1 to decide if selection is a repetition or not.
Rule 1: The selection of tokens of speaker 1 must contain at least one relevant token for speaker 2.
- Parameters
start – (int) Index to start the selection
end – (int) Index to stop the selection
speaker – (DataSpeaker) All the data
- Returns
(bool)
annotations.SelfRepet.sppasbaserepet module¶
- filename
sppas.src.annotations.SelfRepet.sppasbaserepet.py
- author
Brigitte Bigi
- contact
- summary
Base class for SPPAS integration of repetitions detection.
- class annotations.SelfRepet.sppasbaserepet.sppasBaseRepet(config, log=None)[source]¶
Bases:
annotations.baseannot.sppasBaseAnnotation
SPPAS Automatic Any-Repetition Detection.
- __init__(config, log=None)[source]¶
Create a new sppasRepetition instance.
Log is used for a better communication of the annotation process and its results. If None, logs are redirected to the default logging system.
- Parameters
config – (str) Name of the JSON configuration file, without path.
log – (sppasLog) Human-readable logs.
- load_resources(lang_resources, lang=None)[source]¶
Load a list of stop-words and replacements.
Override the existing loaded lists…
- Parameters
lang_resources – (str) File with extension ‘.stp’ or ‘.lem’ or nothing
lang – (str)
- make_stop_words(tier)[source]¶
Return a tier indicating if entries are stop-words.
- Parameters
tier – (sppasTier) Time-aligned tokens.
- make_word_strain(tier)[source]¶
Return a tier with modified tokens.
- Parameters
tier – (sppasTier) Time-aligned tokens.
- set_alpha(alpha)[source]¶
Fix the alpha option.
Alpha is a coefficient to add specific stop-words in the list.
- Parameters
alpha – (float)
annotations.SelfRepet.sppasrepet module¶
- filename
sppas.src.annotations.SelfRepet.sppasrepet.py
- author
Brigitte Bigi
- contact
- summary
SPPAS integration of Self-Repetitiond automatic annotation
- class annotations.SelfRepet.sppasrepet.sppasSelfRepet(log=None)[source]¶
Bases:
annotations.SelfRepet.sppasbaserepet.sppasBaseRepet
SPPAS Automatic Self-Repetition Detection.
Detect self-repetitions. The result has never been validated by an expert. This annotation is performed on the basis of time-aligned tokens or lemmas. The output is made of 2 tiers with sources and echos.
- __init__(log=None)[source]¶
Create a new sppasRepetition instance.
- Parameters
log – (sppasLog) Human-readable logs.
Module contents¶
- filename
sppas.src.annotations.SelfRepet.__init__.py
- author
Brigitte Bigi
- contact
- summary
Self-Repetition detection.
- class annotations.SelfRepet.DataSpeaker(tokens)[source]¶
Bases:
object
Class to store data of a speaker.
Stored data are a list of formatted unicode strings.
- get_next_word(current)[source]¶
Ask for the index of the next word in entries.
:param current (int) Current position to search for the next word :returns: (int) Index of the next word or -1 if no next word can be found.
- is_word(idx)[source]¶
Return true if the entry at the given index is a word.
An empty entry is not a word. Symbols (silences, laughs…) are not words. Hesitations are considered words.
Return False if the given index is wrong.
- Parameters
idx – (int) Index of the entry to get
- Returns
(bool)
- is_word_repeated(current, other_current, other_speaker)[source]¶
Ask for a token to be a repeated word.
- Parameters
current – (int) From index, in current speaker
other_current – (int) From index, in the other speaker
other_speaker – (DataSpeaker) Data of the other speaker
- Returns
index of the echo or -1
- class annotations.SelfRepet.SelfRepetition(stop_list=None)[source]¶
Bases:
annotations.SelfRepet.datastructs.DataRepetition
Self-Repetition automatic detection.
Search for the sources, then find where are the echos.
- __init__(stop_list=None)[source]¶
Create a new SelfRepetitions instance.
- Parameters
stop_list – (StopWords) List of un-relevant tokens.
- detect(speaker, limit=10)[source]¶
Search for the first self-repetition in tokens.
- Parameters
speaker – (DataSpeaker) All the data of speaker
limit – (int) Go no longer than ‘limit’ entries in speaker data
- find_echos(start, end, speaker)[source]¶
Find all echos of a source.
- Parameters
start – (int) start index of the entry of the source (speaker)
end – (int) end index of the entry of the source (speaker)
speaker – (DataSpeaker) All data of speaker
- Returns
DataRepetition()
- class annotations.SelfRepet.SelfRules(stop_list=None)[source]¶
Bases:
object
Rules to select self-repetitions.
Proposed rules deal with the number of words, the word frequencies and distinguishes if the repetition is strict or not. The following rules are proposed for other-repetitions:
Rule 1: A source is accepted if it contains one or more relevant
token. Relevance depends on the speaker producing the echo; - Rule 2: A source which contains at least K tokens is accepted if the repetition is strict.
Rule number 1 need to fix a clear definition of the relevance of a token. Un-relevant tokens are then stored in a stop-list. The stop-list also should contain very frequent tokens in the given language like adjectives, pronouns, etc.
- __init__(stop_list=None)[source]¶
Create a SelfRules instance.
- Parameters
stop_list – (sppasVocabulary or list) Un-relevant tokens.
- count_relevant_tokens(start, end, speaker)[source]¶
Count the number of relevant words from start to end (included).
- Parameters
start – (int) Index to start to count
end – (int) Index to stop to count
speaker – (DataSpeaker) All the data
- Returns
(int)
- is_relevant(idx, speaker)[source]¶
Ask for the entry of a speaker to be relevant or not.
An entry is considered relevant if:
It is not a silence, a pause, a laugh, dummy or a noise;
It is not in the stop-list.
- Parameters
idx – (str) Index of the data to be checked
speaker – (DataSpeaker) All the data
- Returns
(bool)
- rule_one_token(current, speaker)[source]¶
Check whether one token is a self-repetition or not.
Rules are:
the token must be a word, and not in the stop-list;
the token must be repeated.
- Parameters
current – (int) Index of the token to check
speaker – (DataSpeaker) All the data
- Returns
(bool)
- rule_syntagme(start, end, speaker)[source]¶
Apply rule 1 to decide if selection is a repetition or not.
Rule 1: The selection of tokens of speaker 1 must contain at least one relevant token for speaker 2.
- Parameters
start – (int) Index to start the selection
end – (int) Index to stop the selection
speaker – (DataSpeaker) All the data
- Returns
(bool)
- class annotations.SelfRepet.sppasSelfRepet(log=None)[source]¶
Bases:
annotations.SelfRepet.sppasbaserepet.sppasBaseRepet
SPPAS Automatic Self-Repetition Detection.
Detect self-repetitions. The result has never been validated by an expert. This annotation is performed on the basis of time-aligned tokens or lemmas. The output is made of 2 tiers with sources and echos.
- __init__(log=None)[source]¶
Create a new sppasRepetition instance.
- Parameters
log – (sppasLog) Human-readable logs.