Automatic Annotations

Introduction

The kind of process to implement in the perspective of obtaining rich and broad-coverage multimodal/multi-levels annotations of a corpus is illustrated in next Figure. It describes the steps of the annotation workflow. This Figure must be read from top to bottom and from left to right, starting by the recordings and ending to the analysis of annotated files. Yellow boxes represent manual annotations, blue boxes represent automatic ones. This Figure is simplified: there are other ways to construct a corpus, but this one is the best solution to get effective automatic annotations with SPPAS.

Annotation methodology

After recording an audio file (see recording recommendations), the first annotation to perform is to search for the IPUs - Inter-Pausal Units. IPUs are sounding segments surrounded by silent pauses of more than X ms, and time-aligned on the speech signal.

An orthographic transcription (OT) has to be performed manually inside these IPUs. Using an Enriched OT is a better idea – see SPPAS transcription convention. Then text normalization automatic annotation normalizes the orthographic transcription of each IPU. The phonetization converts the normalized text into a grammar of pronunciations using the X-SAMPA standard. Alignment performs segmentation at phonemes and tokens levels, etc.

At the end of each automatic annotation process, SPPAS produces a Procedure Outcome Report. It contains important information about the annotations: it includes all parameters and eventually warnings and errors that occurred during the annotation process. This window opens in the scope to be read by users (!) and should be saved with the annotated corpus.

SPPAS annotation workflow

Annotations of SPPAS are categorized as:

  • STANDALONE: they require the files of a single speaker,
  • SPEAKER: they require the files of a single speaker at two different moments in time, for example before sleeping versus after sleeping, the morning versus the afternoon, 10 years ago versus nowadays
  • INTERACTION: they require the files of two different speakers who are supposed to interact in the same conversation (i.e., the files are of same duration).

All the 23 automatic annotations of type STANDALONE are illustrated in the Figure at top of the page. This Figure can also be downloaded at https://sppas.org/etc/figures/workflow_standalone.pdf. It is also included in the documentation folder of the SPPAS package. It shows the detailed process to perform annotations in a suitable way.

Workflow of the STANDALONE annotations

The best way to read this Figure is to search for the annotation result you want and to follow the arrows that made it possible. They are representing all the annotations you’ll have to ask SPPAS to perform and that you can manually check or customize to make it YOUR SPPAS Solution. Two examples of SPPAS solutions are available on the website: https://sppas.org/workdemo.html.

This chapter describes each annotation box.

Recordings

SPPAS performs automatic annotations: It does not make sense to hope for miracles, but you can expect good enough results that will allow you to save your precious time! And it begins by taking care of the recordings

Audio files

Only wav and au audio file formats are supported by Python, so does SPPAS;

Only mono audio files are supported by automatic annotations of SPPAS.

SPPAS verifies if the audio file is 16 bits sample rate and 16000 Hz frame rate; otherwise it automatically creates a new converted audio file. For very long files, this process may take time. If Python can’t read the audio file, an error message is displayed: you’ll have to convert it with audacity, praat… A relatively good recording quality is expected (see next Figure).

For example, both Search for IPUs and Fill in IPUs require a better quality compared to what is expected by Alignment. For that latter, it depends on the language. The quality of the automatic annotation results highly depends on the quality of the audio file.

Example of expected recorded speech

Providing a guideline or recommendation of good practices is impossible, because it depends on too many factors. However, the followings are obvious:

  • Never records a lossy audio file (like for example, with a smartphone). Moreover, extracting the audio file from a video is only possible if the embedded audio is either lossless or not compressed: see this page https://en.wikipedia.org/wiki/Comparison_of_video_container_formats.
  • The better microphone, the better audio file! Using a headworn microphone is much better than a clip-on one. At LPL, we get very good results with the AKG C520.
  • The recorded volume must be high enough. Ideally, it should be in range [-0.5 ; 0.5]. If all the amplitude values are in range [-0.1 ; 0.1], the difference between speech and silence is very slight, and it makes the search for silences challenging.
  • The audio file should not be 32 floating bits. For speech, 32 bits are totally unuseful and - worse, sometimes Python can’t read it.
  • as you probably don’t plan to burn your audio file on a CD-ROM, 44100Hz framerate does not make sense. 48,000 Hz is a more reasonable choice, particularly because it doesn't need of elaborated interpolation methods when it's converted to 16,000 Hz for automatic annotations.

Video files

SPPAS is proposing a few automatic annotations of a video if the Python library opencv is installed. All of them are annotating the face of recorded people.

File formats and tier names

When annotating with the GUI, the filename of each annotation is fixed and can’t be customized. A filename is made of a root, followed by a pattern then an extension. For example oriana1-palign.TextGrid is made of the root oriana1, the pattern -palign and the extension .TextGrid. Each annotation allows fixing manually the pattern and choosing the extension among the list of the supported ones. Notice that the pattern must start with the - (minus) character. It means that the character - must only be used to separate the root to the pattern:

The character - can’t be used in the root of a filename.

The names of the tiers the annotations are expecting for their input are fixed, and they can't be changed; so do the created tier names.

File extensions are case-sensitive, use TextGrid (Praat) instead of textgrid.

Resources required to annotate

The resources that are required to perform annotations are of two different types:

  1. models, like a face model that is required to perform face detection,
  2. language resources, like a pronunciation dictionary. English and French resources are distributed in the SPPAS package. Resources for other languages can be installed at any time with the setup, see installation instructions.

All the automatic annotations proposed by SPPAS are designed with language-independent algorithms, but some annotations are requiring language-knowledges. This linguistic knowledge is represented in external files, so they can be added, edited or removed easily.

Adding a new language for a given annotation only consists in adding the linguistic resources the annotation needs, like lexicons, dictionaries, models, set of rules, etc. For exemple, see:

Mélanie Lancien, Marie-Hélène Côté, Brigitte Bigi (2020). Developing Resources for Automated Speech Processing of Québec French. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5323–5328, Marseille, France.

Brigitte Bigi, Bernard Caron, Abiola S. Oyelere (2017). Developing Resources for Automated Speech Processing of the African Language Naija (Nigerian Pidgin). In 8th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 441-445, Poznań, Poland.

Download and install linguistic resources

Since June 2020, the linguistic resources and models for some annotations are no longer distributed into the package of SPPAS. Instead, they are hosted by Ortolang repository with public access.

They can be installed automatically into SPPAS by the preinstall.py program (CLI) or in the GUI by clicking Add languages or Add annotations in the toolbar of the Annotate page.

They can also be installed manually by downloading them at: https://hdl.handle.net/11403/sppasresources and unpacking the zip file into the resources folder of SPPAS package.

A full description of such resources and how to install them is available in the repository: download and read the file Documentation.pdf. It contains details about the list of phonemes, authors, licenses, etc.

New language support

Some of the annotations are requiring external linguistic resources in order to work efficiently on a given language: text normalization requires a lexicon, phonetization requires a pronunciation dictionary, etc. It is either possible to install and use the existing resources or to create and use custom ones.

When executing SPPAS, the list of available languages for each annotation is dynamically created by exploring the resources directory content. This means that:

  • the resources you added or modified are automatically taken into account (i.e., there’s no need to modify the program itself);
  • SPPAS needs to be re-started if new resources are added when it was already being running.

Annotate with the GUI

Performing automatic annotations with SPPAS Graphical User Interface is a step-by-step process.

It starts by checking the list of paths and/or roots and/or files in the currently active workspace of the Files page. Then, in the Annotate page:

  1. Select the output file format, i.e., the file format of the files SPPAS will create;
  2. Select a language in the list;
  3. Enable each annotation to perform by clicking on the button in red, among STANDALONE, SPEAKER and INTERACTION annotation types. Each button will be turned green if some annotations are selected.
    • 3.1 Configure each annotation by clicking on the Configure… link text in blue;
    • 3.2 The language of any annotation can be changed.
  4. Click on the Perform annotations button, and wait. A progress bar should indicate the annotation steps and files. Some annotations are very very fast, but some others are not. For example, Face Detection is 2.5 x real times, i.e., annotating a video of 1 minute will take 2 minutes 30 secs.
  5. It is important to read the Procedure Outcome report. It allows checking that everything happened normally during the automatic annotations. This report is saved in the logs folder of the SPPAS package.

Annotate with the CLI

To perform automatic annotations with the Command-line User Interface, there is a main program annotation.py. This program allows annotating in an easy-and-fast way, but none of the annotations can be configured: their default parameters are used. This program performs automatic annotations on a given file or on all files of a directory. It strictly corresponds to the button Perform annotations of the GUI except that annotations are pre-configured: no specific option can be specified.

usage: python .\sppas\bin\annotation.py -I file|folder [options]
optional arguments:
        -h, --help       show this help message and exit
        --log file       File name for a Procedure Outcome Report (default: None)
        --momel          Activate Momel
        --intsint        Activate INTSINT
        --fillipus       Activate Fill in IPUs
        --searchipus     Activate Search for IPUs
        --textnorm       Activate Text Normalization
        --phonetize      Activate Phonetization
        --alignment      Activate Alignment
        --syllabify      Activate Syllabification
        --tga            Activate Time Group Analysis
        --activity       Activate Activity
        --rms            Activate RMS
        --selfrepet      Activate Self-Repetitions
        --stopwords      Activate Stop Tags
        --lexmetric      Activate LexMetric
        --otherrepet     Activate Other-Repetitions
        --reoccurrences  Activate Re-Occurrences
        --merge          Create a merged file with all the annotations
      Files:
        -I file|folder   Input transcription file name (append).
        -l lang          Language code (iso8859-3). One of: por eng ita kor deu nan
                         vie und hun spa cat pol yue fra pcm yue_chars cmn jpn.
        -e .ext          Output file extension. One of: .xra .TextGrid .eaf .csv
                         .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx .arff .xrff

Examples of use:

./sppas/bin/annotation.py -I .\samples\samples-eng
                                -l eng
                                -e .TextGrid
                                --fillipus --textnorm --phonetize --alignment

A progress bar is displayed for each annotation if the Terminal is supporting it (bash for example). Instead, the progress is indicated line-by-line (Windows PowerShell for example).

CLI: annotation.py output example

Each annotation has also its own program and all options can be fixed. They are all located in the sppas/bin folder.

The procedure outcome report

It is crucial to conscientiously read this report: it describes exactly what happened during the automatic annotation process. It is recommended to store a copy of the report within the corpus because it contains information which is interesting to know for anyone using the annotations.

By default, all reports are saved in the logs folder of the SPPAS package.

The text first indicates the version of SPPAS that was used. This information is very important. Annotations in SPPAS and their related resources are regularly improved and then, the result of the automatic process can change from one version to the other one.

Example:

SPPAS version 3.5
      Copyright (C) 2011-2021 Brigitte Bigi
      Web site: https://sppas.org/
      Contact: Brigitte Bigi (contact@sppas.org)

Secondly, the text shows information related to the given input:

  1. the selected language of each annotation, only if the annotation is language-dependent. For some language-dependent annotations, SPPAS can still perform the annotation even if the resources for a given language are not available: in that case, select und, which is the iso639-3 code for undetermined.
  2. the selected files and folder to be annotated.
  3. the list of annotations, and if each annotation was enabled. In that case, enabled means that the checkbox of the annotation was checked by the user and that the resources are available for the given language. On the contrary, disabled means that either the checkbox was not checked or the required resources are not available.
  4. the file format of the resulting files.

Example:

Date: 2020-04-21T11:14:01+02:00
      Input languages:
        - Momel: ---
        - INTSINT: ---
        - Fill in IPUs: ---
        - Search for IPUs: ---
        - Text Normalization: eng
        - Phonetization: eng
        - Alignment: eng
        - Syllabification:
        - Time Group Analysis: ---
        - Activity: ---
        - RMS: ---
        - Self-Repetitions:
        - Stop Tags:
        - LexMetric: ---
        - Other-Repetitions:
        - Re-Occurrences: ---
      Selected files and folders:
        - oriana1.wav
      Selected annotations:
        - Momel: enabled
        - INTSINT: enabled
        - Fill in IPUs: enabled
        - Search for IPUs: disabled
        - Text Normalization: enabled
        - Phonetization: enabled
        - Alignment: enabled
        - Syllabification: disabled
        - Time Group Analysis: disabled
        - Activity: disabled
        - RMS: disabled
        - Self-Repetitions: disabled
        - Stop Tags: disabled
        - LexMetric: disabled
        - Other-Repetitions: disabled
        - Re-Occurrences: disabled
      File extension: .xra
      

Thirdly, each automatic annotation is described in details, for each annotated file. At the first stage, the list of options and their value is summarized. Example:

                        Text Normalization
      The vocabulary contains 121250 tokens.
      The replacement dictionary contains 8 items.
      Options:
       ... inputpattern:
       ... outputpattern: -token
       ... faked: True
       ... std: False
       ... custom: False
       ... occ_dur: True
      

Then, a diagnosis of the given file is printed. It can be: 1. Valid: the file is relevant 2. Admit: the file is not like expected, but SPPAS will convert it and work on the converted file. 3. Invalid: SPPAS can’t work with that file. The annotation is then disabled. In case 2 and 3, a message indicates the origin of the problem.

Then, if any, the annotation procedure prints messages. Four levels of information must draw your attention:

  1. [ OK ] means that everything happened normally. The annotation was performed successfully.
  2. [ IGNORE ] means that SPPAS ignored the file and didn’t do anything.
  3. [ WARNING ] means that something happened abnormally, but SPPAS found a solution, and the annotation was performed anyway.
  4. [ ERROR ] means that something happened abnormally and SPPAS failed to found a solution. The annotation was either not performed, or performed with a wrong result.

Example of Warning message:

 ...  ... Export AP_track_0711.TextGrid
       ...  ... into AP_track_0711.xra
       ...  ... [ IGNORE  ] because a previous segmentation is existing.

Example of Warning message:

 ...  ... [ WARNING  ] chort- is missing of the dictionary and was
                             automatically phonetized as S-O/-R-t

At the end of the report, the Result statistics section mentions the number of files that were annotated for each annotation, or -1 if the annotation was disabled.

Orthographic Transcription

An orthographic transcription is often the minimum requirement for a speech corpus, so it is at the top of the annotation procedure, and it is the entry point for most of the automatic annotations. A transcription convention is designed to provide rules for writing speech corpora. This convention establishes what are the phenomena to transcribe and also how to mention them in the orthography.

From the beginning of its development it was considered to be essential for SPPAS to deal with an Enriched Orthographic Transcription (EOT). The transcription convention is summarized below and all details are given in the file TOE-SPPAS.pdf, available in the documentation folder. It indicates the rules and includes examples of what is expected or recommended.

Convention overview:

  • truncated words, noted as a - at the end of the token string (an ex- example);
  • noises, noted by a * (not available for some languages);
  • laughter, noted by a @ (not available for some languages);
  • short pauses, noted by a +;
  • elisions, mentioned in parentheses;
  • specific pronunciations, noted with brackets [example,eczap];
  • comments are preferably noted inside braces {this is a comment!};
  • comments can be noted inside brackets without using comma [this and this];
  • liaisons, noted between = (this =n= example);
  • morphological variants with <ice scream,I scream>,
  • proper name annotation, like $ John S. Doe $.

The symbols * + @ must be surrounded by whitespace.

SPPAS allows including the regular punctuations. For some languages, it also allows to include numbers: they will be automatically converted to their written form during the Text Normalization process.

From this EOT, several derived transcriptions can be generated automatically, including the two followings:

  1. the standard transcription is the list of orthographic tokens (optional);
  2. a specific transcription from which the phonetic tokens are obtained to be used by the grapheme-phoneme converter that is named faked transcription (the default).

For example, with the transcribed sentence: This [is,iz] + hum… an enrich(ed) transcription {loud} number 1!, the derived transcriptions are:

  1. standard: this is + hum an enriched transcription number one
  2. tokens: this iz + hum an enrich transcription number one

Notice that the convention allows to include a large scale of phenomena, for which most of them are optional. As a minimum, the transcription must include:

  • filled pauses;
  • short pauses;
  • repeats;
  • noises and laugh items (not available for Japanese and Cantonese).

Finally, it has to be noticed that this convention is not software-dependent. The orthographic transcription can be manually performed within SPPAS GUI in the Edit page, with Praat, with Annotation Pro, Audacity, …

Search for Inter-Pausal Units (IPUs)

Overview

The Search for IPUs is a semi-automatic annotation process. This segmentation provides an annotated file with one tier named IPUs. The silence intervals are labeled with the # symbol, and IPUs intervals are labeled with ipu_ followed by the IPU number. This annotation is semi-automatic: it should be verified manually.

Notice that the better recording quality, the better IPUs segmentation.

The parameters

The following parameters must be properly fixed:

  • Minimum volume value (in seconds): If this value is set to zero, the minimum volume is automatically adjusted for each sound file. Try with it first, then if the automatic value is not correct, set it manually. The Procedure Outcome Report indicates the value the system choose. The AudioRoamer component can also be of great help: it indicates min, max and mean volume values of the sound.
  • Minimum silence duration (in seconds): By default, this is fixed to 0.2 sec. This duration mostly depends on the language. It is commonly fixed to at least 0.2 sec for French and at least 0.25 seconds for English language.
  • Minimum speech duration (in seconds): By default, this value is fixed to 0.3 sec. A relevant value depends on the speech style: for isolated sentences, probably 0.5 sec should be better, but it should be about 0.1 sec for spontaneous speech.
  • IPUs boundary shift (in seconds) for start or end: a duration which is systematically added to IPUs boundaries, to enlarge the IPUs interval, and as a consequence, the neighboring silences are reduced.

The procedure outcome report indicates the values (volume, minimum durations) that were used by the system for each sound file.

Perform Search for IPUs with the GUI

It is an annotation of STANDALONE type.

Click on the Search IPUs activation button and on the Configure… blue text to fix options.

Example of result

Notice that the speech segments can be transcribed using SPPAS, in the Analyze page.

Orthographic transcription based on IPUs

Perform Search for IPUs with the CLI

searchipus.py is the program to perform this semi-automatic annotation, i.e. silence/IPUs segmentation, either on a single file (-i and optionally -o) or on a set of files (by using -I and optionally -e).

Usage

searchipus.py [files] [options]
      Search for IPUs: Search for Inter-Pausal Units in an audio file.
      optional arguments:
        -h, --help            show this help message and exit
        --quiet               Disable the verbosity
        --log file            File name for a Procedure Outcome Report (default: None)
      Files (manual mode):
        -i file               Input wav file name.
        -o file               Annotated file with silences/units segmentation
                              (default: None)
      Files (auto mode):
        -I file               Input wav file name (append).
        -e .ext               Output file extension. One of: .xra .TextGrid .eaf
                              .csv .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx
                              .arff .xrff
      Options:
        --outputpattern OUTPUTPATTERN
                              Output file pattern (default: )
        --win_length WIN_LENGTH
                              Window size to estimate rms (in seconds) (default:
                              0.020)
        --threshold THRESHOLD
                              Threshold of the volume value (rms) for the detection
                              of silences, 0=automatic (default: 0)
        --min_ipu MIN_IPU     Minimum duration of an IPU (in seconds) (default:
                              0.300)
        --min_sil MIN_SIL     Minimum duration of a silence (in seconds) (default:
                              0.200)
        --shift_start SHIFT_START
                              Systematically move at left the boundary of the
                              beginning of an IPU (in seconds) (default: 0.01)
        --shift_end SHIFT_END
                              Systematically move at right the boundary of the end
                              of an IPU (in seconds) (default: 0.02)
      This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte
      Bigi. Contact the author at: contact@sppas.org

Examples of use

A single input file and output on stdout:

python .\sppas\bin\searchipus.py -i .\samples\samples-eng\oriana1.wav
          2018-12-19 10:49:32,782 [INFO] Logging set up level=15
          2018-12-19 10:49:32,790 [INFO]  ... Information:
          2018-12-19 10:49:32,792 [INFO]  ... ... Number of IPUs found:       3
          2018-12-19 10:49:32,792 [INFO]  ... ... Threshold volume value:     0
          2018-12-19 10:49:32,792 [INFO]  ... ... Threshold silence duration: 0.200
          2018-12-19 10:49:32,792 [INFO]  ... ... Threshold speech duration:  0.300
          0.000000 1.675000 #
          1.675000 4.580000 ipu_1
          4.580000 6.390000 #
          6.390000 9.880000 ipu_2
          9.880000 11.430000 #
          11.430000 14.740000 ipu_3
          14.740000 17.792000 #

Idem without logs:

python .\sppas\bin\searchipus.py -i .\samples\samples-eng\oriana1.wav --quiet
          0.000000 1.675000 #
          1.675000 4.580000 ipu_1
          4.580000 6.390000 #
          6.390000 9.880000 ipu_2
          9.880000 11.430000 #
          11.430000 14.740000 ipu_3
          14.740000 17.792000 #

Several input files, output in Praat-TextGrid file format:

python .\sppas\bin\searchipus.py -I .\samples\samples-eng\oriana1.wav \
       -I .\samples\samples-eng\oriana3.wave -e .TextGrid
          2018-12-19 10:48:16,520 [INFO] Logging set up level=15
          2018-12-19 10:48:16,522 [INFO] File oriana1.wav: Valid.
          2018-12-19 10:48:16,532 [INFO]  ... Information:
          2018-12-19 10:48:16,532 [INFO]  ... ... Number of IPUs found:       3
          2018-12-19 10:48:16,532 [INFO]  ... ... Threshold volume value:     0
          2018-12-19 10:48:16,532 [INFO]  ... ... Threshold silence duration: 0.200
          2018-12-19 10:48:16,533 [INFO]  ... ... Threshold speech duration:  0.300
          2018-12-19 10:48:16,538 [INFO]  ... E:\bigi\Projets\sppas\samples\samples-eng\oriana1.TextGrid
          2018-12-19 10:48:16,538 [INFO] File oriana3.wave: Invalid.
          2018-12-19 10:48:16,539 [ERROR]  ... ... An audio file with only one channel is expected. Got 2
     channels.
          2018-12-19 10:48:16,540 [INFO]  ... No file was created.

Fill in Inter-Pausal Units (IPUs)

Overview

This automatic annotation consists in aligning macro-units of a document with the corresponding sound. This segmentation provides an annotated file with one tier named Transcription.

IPUs are blocks of speech bounded by silent pauses of more than X ms. This annotation searches for a silences/IPUs segmentation of a recorded file (see the previous section) and fill in the IPUs with the transcription given in a txt file.

How does it work

SPPAS identifies silent pauses in the signal and attempts to align them with the units proposed in the transcription file, under the assumption that each such unit is separated by a silent pause. It is based on the search of silences described in the previous section, but in this case, the number of units to find is known. The system adjusts automatically the volume threshold and the minimum durations of silences/IPUs to get the right number of units. The content of the units has no regard, because SPPAS does not interpret them: it can be the orthographic transcription, a translation, numbers, … This algorithm is language-independent: it can work on any language.

In the transcription file, silent pauses must be indicated using both solutions, which can be combined:

  • with the symbol #;
  • with newlines.

A recorded speech file must strictly correspond to a txt file of the transcription. The annotation provides an annotated file with one tier named Transcription. The silence intervals are labelled with the # symbol, as IPUs are labelled with ipu_ followed by the IPU number then the corresponding transcription.

The same parameters as those indicated in the previous section must be fixed.

Remark: This annotation was tested on read speech no longer than a few sentences (about 1 minute speech) and on recordings of very good quality.

Fill in IPUs

Perform Fill in IPUs with the GUI

It is an annotation of STANDALONE type.

Click on the Fill in IPUs activation button and on the Configure… blue text to fix options.

Perform Fill in IPUs with the CLI

fillipus.py is the program to perform this IPUs segmentation, i.e. silence/ipus segmentation, either on a single file (-i and optionally -o) or on a set of files (by using -I and optionally -e).

Usage

fillipus.py [files] [options]
      Fill in IPUs: Search for Inter-Pausal Units and fill in with a transcription.
      Requires an audio file and a .txt file with the transcription.
      optional arguments:
        -h, --help         show this help message and exit
        --quiet            Disable the verbosity
        --log file         File name for a Procedure Outcome Report (default: None)
      Files (manual mode):
        -i file            Input wav file name.
        -t file            Input transcription file name.
        -o file            Annotated file with filled IPUs
      Files (auto mode):
        -I file            Input wav file name (append).
        -e .ext            Output file extension. One of: .xra .TextGrid .eaf .csv
                           .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx .arff .xrff
      Options:
        --outputpattern OUTPUTPATTERN
                              Output file pattern (default: )
        --min_ipu MIN_IPU  Initial minimum duration of an IPU (in seconds) (default:
                           0.300)
        --min_sil MIN_SIL  Initial minimum duration of a silence (in seconds)
                           (default: 0.200)
      This program is part of SPPAS version 3.0. Copyright (C) 2011-2020 Brigitte
      Bigi. Contact the author at: contact@sppas.org

Examples of use

A single input file with an input in manual mode:

python .\sppas\bin\fillipus.py -i .\samples\samples-eng\oriana1.wav -t
     .\samples\samples-eng\oriana1.txt
          2018-12-19 11:03:15,614 [INFO] Logging set up level=15
          2018-12-19 11:03:15,628 [INFO]  ... Information:
          2018-12-19 11:03:15,628 [INFO]  ... ... Threshold volume value:     122
          2018-12-19 11:03:15,630 [INFO]  ... ... Threshold silence duration: 0.200
          2018-12-19 11:03:15,630 [INFO]  ... ... Threshold speech duration:  0.300
          0.000000 1.675000 #
          1.675000 4.570000 the flight was 12 hours long and we really got bored
          4.570000 6.390000 #
          6.390000 9.870000 they only played two movies + which we had both already seen
          9.870000 11.430000 #
          11.430000 14.730000 I never get to sleep on the airplane because it's so uncomfortable
          14.730000 17.792000 #

A single input file in automatic mode:

python .\sppas\bin\fillipus.py -I .\samples\samples-eng\oriana1
      python .\sppas\bin\fillipus.py -I .\samples\samples-eng\oriana1.wav
      python .\sppas\bin\fillipus.py -I .\samples\samples-eng\oriana1.txt

Text normalization

Overview

In principle, any system that deals with unrestricted text need the text to be normalized. Texts contain a variety of non-standard token types such as digit sequences, words, acronyms and letter sequences in all capitals, mixed case words, abbreviations, roman numerals, URL’s and e-mail addresses… Normalizing or rewriting such texts using ordinary words is then an important issue. The main steps of the text normalization implemented in SPPAS (Bigi 2011) are:

  • Replace symbols by their written form, thanks to a replacement dictionary, located into the folder repl in the resources directory.
  • Word segmentation based on the content of a lexicon.
  • Convert numbers to their written form.
  • Remove punctuation.
  • Lower the text.

Adapt Text normalization

Word segmentation of SPPAS is mainly based on the use of a lexicon. If a segmentation is not as expected, it is up to the user to modify the lexicon: Lexicons of all supported languages are all located in the folder vocab of the resources directory. They are in the form of one word at a line with UTF-8 encoding and LF for newline.

Support of a new language

Adding a new language in Text Normalization consists in the following steps:

  1. Create a lexicon. Fix properly its encoding (utf-8), its newlines (LF), and fix the name and extension of the file as follows:
    • language name with iso639-3 standard
    • extension .vocab
  2. Put this lexicon in the resources/vocab folder
  3. Create a replacement dictionary for that language (take a look on the ones of the other language!)
  4. Optionally, the language can be added into the num2letter.py program

That’s it for most of the languages! If the language requires more steps, simply write to the author to collaborate, find some funding, etc. like it was already done for Cantonese (Bigi & Fung 2015) for example.

Perform Text Normalization with the GUI

It is an annotation of STANDALONE type.

The SPPAS Text normalization system takes as input a file (or a list of files) for which the name strictly matches the name of the audio file except the extension. For example, if a file with name oriana1.wav is given, SPPAS will search for a file with name oriana1.xra at a first stage if .xra is set as the default extension, then it will search for other supported extensions until a file is found.

This file must include a tier with an orthographic transcription. At a first stage, SPPAS tries to find a tier with transcription as name. If such a tier does not exist, the first tier that is matching one of the following strings is used (case-insensitive search):

  1. trans
  2. trs
  3. toe
  4. ortho

Text normalization produces a file with -token appended to its name, i.e., oriana1-token.xra for the previous example. By default, this file is including only one tier with the resulting normalization and with name Tokens. To get other versions of the normalized transcription, click on the Configure text then check the expected tiers.

Read the Introduction of this chapter for a better understanding of the difference between standard and faked results.

To perform the text normalization process, click on the Text Normalization activation button, select the language and click on the Configure… blue text to fix options.

Perform Text Normalization with the CLI

normalize.py is the program to perform Text Normalization, i.e., the text normalization of a given file or a raw text.

Usage

normalize.py [files] [options]
      Text Normalization: Text normalization segments the orthographic transcription
      into tokens and remove punctuation, convert numbers, etc. Requires an
      orthographic transcription into IPUs.
      optional arguments:
        -h, --help       show this help message and exit
        --quiet          Disable the verbosity
        --log file       File name for a Procedure Outcome Report (default: None)
      Files (manual mode):
        -i file          Input transcription file name.
        -o file          Annotated file with normalized tokens.
      Files (auto mode):
        -I file          Input transcription file name (append).
        -l lang          Language code (iso8859-3). One of: cat cmn deu eng fra hun
                         ita jpn kor nan pcm pol por spa vie yue yue_chars.
        -e .ext          Output file extension. One of: .xra .TextGrid .eaf .csv
                         .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx .arff .xrff
      Resources:
        -r vocab         Vocabulary file name
      Options:
        --inputpattern INPUTPATTERN
                              Input file pattern (orthographic transcription)
                              (default: )
        --outputpattern OUTPUTPATTERN
                              Output file pattern (default: -token)
        --faked FAKED    Create a tier with the faked tokens (required for
                         phonetization) (default: True)
        --std STD        Create a tier with the standard tokens (useful if EOT)
                         (default: False)
        --custom CUSTOM  Create a customized tier (default: False)
        --occ_dur OCC_DUR     Create tiers with number of tokens and duration of
                              each IPU (default: True)
      This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte
      Bigi. Contact the author at: contact@sppas.org

Examples of use

A single input file with a raw transcription input in manual mode:

python .\sppas\bin\normalize.py -r .\resources\vocab\eng.vocab -i .\samples\samples-
     eng\oriana1.txt
          2018-12-19 11:48:34,151 [INFO] Logging set up level=15
          2018-12-19 11:48:34,473 [INFO]  ... ... Intervalle numéro 1.
          2018-12-19 11:48:34,477 [INFO]  ... ... Intervalle numéro 2.
          2018-12-19 11:48:34,480 [INFO]  ... ... Intervalle numéro 3.
          Tokens
          1, the flight was twelve hours long and we really got bored
          2, they only played two movies + which we had both already seen
          3, i never get to sleep on the airplane because it's so uncomfortable

A single input file with a transcription time-aligned into the IPUS, in manual mode and no logs:

python .\sppas\bin\normalize.py -r .\resources\vocab\eng.vocab
      -i .\samples\samples-eng\oriana1.xra --quiet
          Tokens
          0.000000, 1.675000 #
          1.675000, 4.570000 the flight was twelve hours long and we really got bored
          4.570000, 6.390000 #
          6.390000, 9.870000 they only played two movies + which we had both already seen
          9.870000, 11.430000 #
          11.430000, 14.730000 i never get to sleep on the airplane because it's so uncomfortable
          14.730000, 17.792000 #

The same file in automatic mode can be annotated with one of the following commands:

python .\sppas\bin\normalize.py -I .\samples\samples-eng\oriana1.xra -l eng
      python .\sppas\bin\normalize.py -I .\samples\samples-eng\oriana1.txt -l eng
      python .\sppas\bin\normalize.py -I .\samples\samples-eng\oriana1.wav -l eng
      python .\sppas\bin\normalize.py -I .\samples\samples-eng\oriana1 -l eng

This program can also normalize data from the standard input. Example of use, using stdin/stdout under Windows:

Write-Output "The flight was 12 HOURS {toto} long." |
      python .\sppas\bin\normalize.py -r .\resources\vocab\eng.vocab --quiet
          the
          flight
          was
          twelve
          hours
          long

In that case, the comment mentioned with the braces is removed and the number is converted to its written form. The character “_” is used for compound words (it replaces the whitespace).

Phonetization

Overview

Phonetization, also called grapheme-phoneme conversion, is the process of representing sounds with phonetic signs. However, converting from written texts into actual sounds, for any language, cause several problems that have their origins in the relative lack of correspondence between the spelling of the lexical items and their sound contents. As a consequence, SPPAS implements a dictionary-based solution which consists in storing a maximum of phonological knowledge in a lexicon. This approach is then language-independent. SPPAS phonetization process is the equivalent of a sequence of dictionary look-ups.

Most of the other systems assume that all words of the speech transcription are mentioned in the pronunciation dictionary. On the contrary, SPPAS includes a language-independent algorithm which is able to phonetize unknown words of any language as long as a (minimum) dictionary is available (Bigi 2013). The Procedure Outcome Report reports on such cases with a WARNING message.

Adapt Phonetization

Since Phonetization is only based on the use of a pronunciation dictionary, the quality of the result only depends on this resource. If a pronunciation is not as expected, it is up to the user to change it in the dictionary: Dictionaries are located in the folder dict of the resources directory. They are all with UTF-8 encoding and LF for newline. The format of the dictionaries is HTK-like. As example, below is a piece of the eng.dict file:

    THE             [THE]           D @
          THE(2)          [THE]           D V
          THE(3)          [THE]           D i:
          THEA            [THEA]          T i: @
          THEALL          [THEALL]        T i: l
          THEANO          [THEANO]        T i: n @U
          THEATER         [THEATER]       T i: @ 4 3:r
          THEATER'S       [THEATER'S]     T i: @ 4 3:r z

The first column indicates the word, followed by the variant number (except for the first one). The second column indicates the word between brackets. The last columns are the succession of phones, separated by a whitespace. SPPAS is relatively compliant with the format and accept empty brackets or missing brackets.

The phoneset of the languages are mainly based on X-SAMPA international standard. See the chapter Resources of this documentation to know the list of accepted phones for a given language. This list can’t be extended nor modified by users. However, new phones can be added: Send an e-mail to the author to collaborate in that way.

Actually, some words can correspond to several entries in the dictionary with various pronunciations. These pronunciation variants are stored in the phonetization result. By convention, whitespace separate words, minus characters separate phones and pipe character separate phonetic variants of a word. For example, the transcription utterance:

  • Transcription: The flight was 12 hours long.
  • Text Normalization: the flight was twelve hours long
  • Phonetization: D-@|D-V|D-i: f-l-aI-t w-A-z|w-V-z|w-@-z|w-O:-z t-w-E-l-v aU-3:r-z|aU-r-z l-O:-N

Support of a new language

The support of a new language in Phonetization only consists in: 1. creating the pronunciation dictionary. The following constraints on the file must be respected: - its format (HTK-like), - its encoding (UTF-8), - its newlines (LF), - its phone set (X-SAMPA), - its file name (iso639-3 of the language and .dict extension). 2. adding the dictionary in the dict folder of the resources directory.

Perform Phonetization with the GUI

It is an annotation of STANDALONE type.

The Phonetization process takes as input a file that strictly match the audio file name except for the extension and that -token is appended. For example, if the audio file name is oriana1.wav, the expected input file name is oriana1-token.xra if .xra is the default extension for annotations. This file must include a normalized orthographic transcription. The name of such tier must contain one of the following strings:

  1. tok
  2. trans

The first tier that matches one of these requirements is used (this match is case-insensitive).

Phonetization produces a file with -phon appended to its name, i.e. oriana1-phon.xra for the previous example. This file contains only one tier with the resulting phonetization and with name Phones.

To perform the annotation, click on the Phonetization activation button, select the language and click on the Configure… blue text to fix options.

Perform Phonetization with the CLI

phonetize.py is the program to perform Phonetization on a given file, i.e., the grapheme-conversion of a file or a raw text.

Usage

phonetize.py [files] [options]
      Phonetization: Grapheme to phoneme conversion represents sounds with phonetic
      signs. Requires a Text Normalization.
      optional arguments:
        -h, --help            show this help message and exit
        --quiet               Disable the verbosity
        --log file            File name for a Procedure Outcome Report (default: None)
      Files (manual mode):
        -i file               Input tokenization file name.
        -o file               Annotated file with phonetization.
      Files (auto mode):
        -I file               Input transcription file name (append).
        -l lang               Language code (iso8859-3). One of: cat cmn deu eng fra
                              ita jpn kor nan pcm pol por spa yue yue_chars.
        -e .ext               Output file extension. One of: .xra .TextGrid .eaf
                              .csv .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx
                              .arff .xrff
      Resources:
        -r dict               Pronunciation dictionary (HTK-ASCII format).
        -m map_file           Pronunciation mapping table. It is used to generate
                              new pronunciations by mapping phonemes of the
                              dictionary.
      Options:
        --inputpattern INPUTPATTERN
                              Input file pattern (tokenization) (default: -token)
        --outputpattern OUTPUTPATTERN
                              Output file pattern (default: -phon)
        --unk UNK             Try to phonetize unknown words (default: True)
        --usestdtokens USESTDTOKENS
                              Phonetize from standard spelling (default: False)
      This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte
      Bigi. Contact the author at: contact@sppas.org

Examples of use

Obviously, before launching the following commands, you already prepared the required file (the result of text normalization segmented into IPUs).

Example of the phonetization of a single input file in manual mode:

python .\sppas\bin\phonetize.py -r .\resources\dict\eng.dict
        -i .\samples\samples-eng\oriana1-token.xra --quiet
          Phones
          0.000000, 1.675000, sil
          1.675000, 4.570000, {D-@|D-i:|D-V} f-l-aI-t {w-@-z|w-V-z|w-O:-z|w-A-z} t-w-E-l-v
          {aU-3:r-z|aU-r\-z} l-O:-N {{-n-d|@-n-d} w-i: {r\-I-l-i:|r\-i:-l-i:} g-A-t b-O:-r\-d
          4.570000, 6.390000, sil
          6.390000, 9.870000, D-eI @U-n-l-i: p-l-eI-d t-u m-u-v-i:-z sil {h-w-I-tS|w-I-tS}
          w-i: h-{-d b-@U-T {O:-l-r\-E-4-i:|O:-r\-E-4-i:} s-i:-n
          9.870000, 11.430000, sil
          11.430000, 14.730000, aI n-E-v-3:r {g-I-t|g-E-t} {t-@|t-i|t-u} s-l-i:-p
          {O:-n|A-n} {D-@|D-i:|D-V} E-r\-p-l-eI-n {b-i-k-O:-z|b-i-k-V-z} {i-t-s|I-t-s}
          s-@U @-n-k-V-m-f-3:r-4-@-b-@-l
          14.730000, 17.792000, sil

The same file in automatic mode can be annotated with one of the following commands:

python .\sppas\bin\phonetize.py -l eng -I .\samples\samples-eng\oriana1-token.xra
      python .\sppas\bin\phonetize.py -l eng -I .\samples\samples-eng\oriana1.xra
      python .\sppas\bin\phonetize.py -l eng -I .\samples\samples-eng\oriana1.txt
      python .\sppas\bin\phonetize.py -l eng -I .\samples\samples-eng\oriana1.wav
      python .\sppas\bin\phonetize.py -l eng -I .\samples\samples-eng\oriana1

This program can also phonetize data from the standard input. Example of use, using stdin/stdout under Windows:

Write-Output "The flight was 12 HOURS {toto} long." |
      python .\sppas\bin\normalize.py -r .\resources\vocab\eng.vocab --quiet |
      python .\sppas\bin\phonetize.py -r .\resources\dict\eng.dict --quiet
          D-@|D-V|D-i:
          f-l-aI-t
          w-A-z|w-V-z|w-@-z|w-O:-z
          t-w-E-l-v
          aU-3:r-z|aU-r\-z
          l-O:-N

Alignment

Overview

Alignment, also called phonetic segmentation, is the process of aligning speech with its corresponding transcription at the phone level. The alignment problem consists in a time-matching between a given speech unit along with a phonetic representation of the unit.

SPPAS Alignment does not perform the segmentation itself. It is a wrapper either for the Julius Speech Recognition Engine (SRE) or for the HVite command of HTK-Toolkit. In addition, SPPAS can perform a basic alignment, assigning the same duration to each sound.

Speech Alignment requires an Acoustic Model to align speech. An acoustic model is a file that contains a statistical representation of every distinct sound ine a given language. Each sound is represented by one of these statistical representations. The quality of the alignment result only depends on both this resource and on the aligner. From our experiences, we got better results with Julius. See the chapter 4 Resources for Automatic Annotations to get the list of sounds of each language.

Notice that SPPAS allows to time-align automatically laugh, noises, or filled pauses (depending on the language): No other system is able to achieves this task!

SPPAS alignment output example

Adapt Alignment

The better Acoustic Model, the better alignment results. Any user can append or replace the acoustic models included in the models folder of the resources directory. Be aware that SPPAS only supports HTK-ASCII acoustic models, trained from 16 bits, 16000 Hz wave files.

The existing models can be improved if they are re-trained with more data. To get a better alignment result, any new data is then welcome: send an e-mail to the author to share your recordings and transcripts.

Support of a new language

The support of a new language in Alignment only consists in adding a new acoustic model of the appropriate format, in the appropriate directory, with the appropriate phone set.

The articulatory representations of phonemes are so similar across languages that phonemes can be considered as units which are independent of the underlying language (Schultz et al. 2001). In SPPAS package, 9 acoustic models of the same type - i.e., same HMMs definition and acoustic parameters, are already available so that the phoneme prototypes can be extracted and reused to create an initial model for a new language.

Any new model can also be trained by the author, as soon as enough data is available. It is challenging to estimate exactly the amount of data a given language requires. That is said, we can approximate the minimum as follows:

  • 3 minutes altogether of various speakers, manually time-aligned at the phoneme level.
  • 10 minutes altogether of various speakers, time-aligned at the ipus level with the enriched orthographic transcription.
  • more data is good data.

Perform Alignment with the GUI

It is an annotation of STANDALONE type.

The Alignment process takes as input one or two files that strictly match the audio file name except for the extension and that -phon is appended for the first one and -token for the optional second one. For example, if the audio file name is oriana1.wav, the expected input file name is oriana1-phon.xra with phonetization and optionally oriana1-token.xra with text normalization, if .xra is the default extension for annotations.

The speech segmentation process provides one file with name -palign appended to its name, i.e. oriana1-palign.xra for the previous example. This file includes one or two tiers:

  • PhonAlign is the segmentation at the phone level;
  • TokensAlign is the segmentation at the word level (if a file with tokenization was found).

The following options are available to configure Alignment:

  • choose the speech segmentation system. It can be either: julius, hvite or basic
  • perform basic alignment if the aligner failed, instead such intervals are empty.
  • remove working directory will keep only alignment results: it will remove working files. Working directory includes one wav file per unit and a set of text files per unit.
  • create the PhnTokAlign will append anoter tier with intervals of the phonetization of each word.

To perform the annotation, click on the Alignment activation button, select the language and click on the Configure… blue text to fix options.

Perform Alignment with the CLI

alignment.py is the program to perform automatic speech segmentation of a given phonetized file.

Usage

alignment.py [files] [options]
      Alignment: Time-alignment of speech audio with its corresponding transcription
      at the phone and token levels. Requires a Phonetization.
      optional arguments:
        -h, --help            show this help message and exit
        --quiet               Disable the verbosity
        --log file            File name for a Procedure Outcome Report (default: None)
      Files (manual mode):
        -i file               Input wav file name.
        -p file               Input file name with the phonetization.
        -t file               Input file name with the tokenization.
        -o file               Output file name with estimated alignments.
      Files (auto mode):
        -I file               Input transcription file name (append).
        -l lang               Language code (iso8859-3). One of: cat cmn deu eng
                              eng-cd fra ita jpn kor nan pcm pol por spa yue.
        -e .ext               Output file extension. One of: .xra .TextGrid .eaf
                              .csv .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx
                              .arff .xrff
      Resources:
        -r model              Directory of the acoustic model of the language of the
                              text
        -R model              Directory of the acoustic model of the mother language
                              of the speaker (under development)
      Options:
        --inputpattern INPUTPATTERN
                              Input file pattern (phonetization) (default: -phon)
        --inputoptpattern INPUTOPTPATTERN
                              Optional input file pattern (tokenization) (default:
                              -token)
        --outputpattern OUTPUTPATTERN
                              Output file pattern (default: -palign)
        --aligner ALIGNER     Speech automatic aligner system (julius, hvite,
                              basic): (default: julius)
        --basic BASIC         Perform basic alignment if the aligner fails (default:
                              False)
        --clean CLEAN         Remove working directory (default: True)
        --activity ACTIVITY   Create the Activity tier (default: True)
        --activityduration ACTIVITYDURATION
                              Create the ActivityDuration tier (default: False)
      This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte
      Bigi. Contact the author at: contact@sppas.org

Example of use

Obviously, before launching the following command, you already prepared the required file (the result of phonetization) and the optional file (the result of text normalization segmented into IPUs).

python .\sppas\bin\alignment.py -I .\samples\samples-eng\oriana1.wav -l eng
          2018-12-19 18:33:38,842 [INFO] Logging set up level=15
          2018-12-19 18:33:38,844 [INFO] Options
          2018-12-19 18:33:38,844 [INFO]  ... activityduration: False
          2018-12-19 18:33:38,845 [INFO]  ... activity: True
          2018-12-19 18:33:38,845 [INFO]  ... aligner: julius
          2018-12-19 18:33:38,845 [INFO]  ... clean: True
          2018-12-19 18:33:38,845 [INFO]  ... basic: False
          2018-12-19 18:33:38,845 [INFO] File oriana1.wav: Valid.
          2018-12-19 18:33:38,845 [INFO] File oriana1-phon.xra: Valid.
          2018-12-19 18:33:38,846 [INFO] File oriana1-token.xra: Valid.
          2018-12-19 18:33:38,846 [WARNING]  ... ... A file with name
     E:\bigi\Projets\sppas\samples\samples-eng\oriana1-palign.xra is already existing. It will be
     overridden.
          2018-12-19 18:33:38,855 [INFO]  ... Découpage en intervalles.
          2018-12-19 18:33:38,901 [INFO]  ... Intervalle numéro 1.
          2018-12-19 18:33:38,904 [INFO]  ... Intervalle numéro 2.
          2018-12-19 18:33:38,908 [INFO]  ... Intervalle numéro 3.
          2018-12-19 18:33:38,913 [INFO]  ... Intervalle numéro 4.
          2018-12-19 18:33:38,917 [INFO]  ... Intervalle numéro 5.
          2018-12-19 18:33:38,921 [INFO]  ... Intervalle numéro 6.
          2018-12-19 18:33:38,926 [INFO]  ... Intervalle numéro 7.
          2018-12-19 18:33:38,928 [INFO]  ... Fusion des alignements des intervalles.
          2018-12-19 18:33:38,969 [INFO]  ... Création de la tier des activités.
          2018-12-19 18:33:38,993 [INFO]  ... E:\bigi\Projets\sppas\samples\samples-
     eng\oriana1-palign.xra

Activity

Overview

Activity tier represents speech activities, i.e. speech, silences, laughter, noises… It is based on the analysis of the time-aligned tokens.

Perform Activity with the GUI

It is an annotation of STANDALONE type.

The Activity process takes as input a file that strictly match the audio file name except for the extension and that -palign is appended. For example, if the audio file name is oriana1.wav, the expected input file name is oriana1-palign.xra if .xra is the default extension for annotations. This file must include time-aligned phonemes in a tier with name PhonAlign.

The annotation provides an annotated file with -activity appended to its name, i.e. oriana1-activity.xra for the previous example. This file is including 1 or 2 tiers: Activity, ActivityDuration.

To perform the annotation, click on the Activity activation button and click on the Configure… blue text to fix options.

Perform Activity with the CLI

No CLI is available for this annotation.

RMS

Overview

The Root-Mean Square - RMS is a measure of the power in an audio signal. It is estimated from the amplitude values by: sqrt(sum(S_i^2)/n).

RMS automatic annotation estimates the rms value on given intervals of an audio file. Empty intervals, i.e., intervals without labels, are ignored. By default, the RMS is estimated on a tier with name PhonAlign of an annotated file with pattern -palign. Both can be modified by configuring the annotations. The annotation provides an annotated file with -rms appended to its name. This file is including three tiers:

  • RMS: indicates the RMS value estimated on each non-empty interval;
  • RMS-values: indicates RMS values estimated every 10 ms in each interval;
  • RMS-mean: indicates the mean of the previous values.

Perform RMS with the GUI

It is an annotation of STANDALONE type.

To perform the annotation, click on the RMS activation button and click on the Configure… blue text to fix options.

Perform RMS with the CLI

rms.py is the program to perform this annotation, either on a single given file (-i and -t) or on a set of files (-I).

Anonymization

Overview

This automatic annotation allows to buzz intervals by doing the 3 following actions:

  • replace the content of the annotated interval by buzz, and do it also on the corresponding intervals of all tiers in the given file;
  • make the audio segment incomprehensible – if enabled;
  • blur the detected face or mouth areas of the video – if enabled.

This annotation is useful to anonymize a corpus by buzzing proper names, places, etc. It is also useful to buzz some bad language like coarse expressions.

There are many parameters allowing to buzz a bunch of intervals given in any tier, of any file. Such parameters for the annotated file with the intervals are:

  • inputpattern: The pattern of the file which contain the intervals to be selected. It’s default value is -palign which contains time-aligned tokens and phonemes.
  • buzztier: The name of the tier with intervals to be buzzed can be fixed. By default, TokensAlign is used, but any other tier can be used.
  • buzzname: This is the pattern to find when filtering the intervals of the tier. By default, this pattern is $. It can be any character or string.
  • buzzfilter: This is the name of the filter to be applied to select buzzed intervals. By default, the contains filter is used. It means that the system will select any interval containing the given pattern.

The parameters for the video file corresponding to the given annotated file are:

  • inputpattern2: The pattern of the video file. By default, no pattern is given, but it could be -identid001 for example.
  • buzzvideo: It is the mode of the video anonymization. It must be one of: none, face or mouth. By default, face is used. When choosing mouth, only the bottom part of the face is blur. Choosing none turns off the video anonymization.

The parameters for the audio file corresponding to the given annotated file are:

  • inputpattern3: The pattern of the audio file. By default, no pattern is given.
  • buzzaudio: This is to turn on or off the anonymization of the audio file.

Filtering the intervals

The following filters are proposed to select the intervals which are be to anonymized:

  • exact: An annotation is selected if its label strictly corresponds to the given pattern.
  • contains: An annotation is selected if its label contains the given pattern.
  • startswith: An annotation is selected if its label starts with the given pattern.
  • endswith: An annotation is selected if its label ends with the given pattern.

When prepending a not to the name of the filter, all these matches can be reversed to represent respectively: does not exactly match, does not contain, does not start with or does not end with.

Moreover, the pattern matching can be case-sensitive - by default, or case-insensitive by adding the letter i before the name of the filter, like iexact or icontains.

Finally, the filter regexp is also available for the pattern to match the annotation label with a regular expression.

Here is the full list of allowed filters:
exact, contains, startswith, endswith, regexp, iexact, icontains, istartswith, iendswith, not_exact, not_contains, not_startswith, not_endswith, not_iexact, not_icontains, not_istartswith, not_iendswith.

Anonymization of the audio

The selected audio segments are anonymized, which means the original content can’t be heard, and it can’t get back by a reverse process. The implemented algorithm allows preserving the intensity, but the pitch values are lost. If preserving the F0 is important, the Praat script anonymise_long_sound.praat is a better solution. It was written by Daniel Hirst and is freely available here: https://www.ortolang.fr/market/tools/sldr000526. It can be used with the buzzed tier - by SPPAS, as input.

Perform Anonymization with the GUI

It is an annotation of STANDALONE type.

To perform the annotation, click on the Anonymization activation button and click on the Configure… blue text to fix options.

Perform Anonymization with the CLI

anonymize.py is the program to perform this annotation. For example, the -anonym files of the demo were obtained using the following command-line:

                 
                     > .sppaspyenv~/bin/python3 ./sppas/bin/anonymize.py 
               -I demo/demo.mp4 
               --buzzname="é" 
               --buzzvideo=mouth 
               -e .TextGrid

The tiers with the tokens and the phonemes will be buzzed in all tokens containing the character é. The audio is buzzed during these intervals, and so does the detected mouth(s) of the video.

Interval Values Analysis - IVA

Overview

The Interval Values Analysis - IVA is producing statistical information about a set of values in given intervals. IVA can, for example, estimate the mean/stdev values on given intervals (IPUs, …) of a pitch file. Empty intervals, i.e., unlabeled intervals, are ignored, and a list of tags to be ignored can be fixed.

By default, the IVA is estimated with the values of a PitchTier inside the intervals defined in a tier with name TokensAlign of a file with pattern -palign. If a list of separators is given, the intervals are created: an IVA segment is a set of consecutive annotations without separators. Default separators are # + @ * dummy in order to ignore silences, laughter items, noises and untranscribed speech. However, if no separator is given, IVA segments are matching the intervals of the given input tier. In the latter case, be aware that some file formats - including TextGrid, are not supporting holes: they create unlabelled intervals between the labeled ones.

Both tiernames and patterns can be modified by configuring the annotation. The annotation provides an annotated file with the -iva pattern. This file includes the tiers:

  • IVA-Segments: defined intervals;
  • IVA-Values: the values extracted for each segment;
  • IVA-Occurrences: indicates the number of values in each segment;
  • IVA-Total: indicates the sum of values in each segment;
  • IVA-Mean: indicates the mean of values in each segment;
  • IVA-Median: indicates the median of values in each segment;
  • IVA-StdDev: indicates the standard deviation of values in each segment;
  • IVA-Intercept: indicates the intercept value of the linear regression of values of each segment;
  • IVA-Slope: indicates the slope value of the linear regression of values of each segment.

Perform IVA with the GUI

It is an annotation of STANDALONE type.

To perform the annotation, click on the IVA activation button and click on the Configure… blue text to fix options.

Perform IVA with the CLI

iva.py is the program to perform this annotation, either on a single given file (-i and -s) or on a set of files (-I).

Lexical Metric

Overview

The Lexical Metric is producing information about the number of occurrences and the rank of each occurrence in annotation labels.

By default, the lexical metrics are estimated on a tier with name TokensAlign of a file with pattern -palign. If a list of separators is given, segments are created to estimate a number of occurrences. Default separators are # + @ * dummy in order to ignore silences, laughter items, noises and untranscribed speech.

Both the tiername and the pattern can be modified by configuring the annotation. The annotation provides an annotated file with the -lexm pattern. This file includes the tiers:

  • LM-OccAnnInSegments: defined intervals with number of occurrences of annotations;
  • LM-OccLabInSegments: defined intervals with number of occurrences of labels;
  • LM-Occ: the number of occurrences in each label the annotation is representing;
  • LM-Rank: the rank of each label the annotation is representing.

Perform Lexical Metric with the GUI

It is an annotation of STANDALONE type.

To perform the annotation, click on the Lexical Metric activation button and click on the Configure… blue text to fix options.

Syllabification

Overview

The syllabification of phonemes is performed with a rule-based system from time-aligned phonemes. This phoneme-to-syllable segmentation system is based on 2 main principles:

  • a syllable contains a vowel, and only one;
  • a pause is a syllable boundary.

These two principles focus the problem of the task of finding a syllabic boundary between two vowels. Phonemes were grouped into classes and rules are established to deal with these classes.

Syllabification example

For each language, the automatic syllabification requires a configuration file to fix phonemes, classes and rules.

Adapt Syllabification

Any user can change the set of rules by editing and modifying the configuration file of a given language. Such files are located in the folder syll of the resources directory. Files are all with UTF-8 encoding and LF for newline.

At first, the list of phonemes and the class symbol associated with each of the phonemes are described as, for example:

  • PHONCLASS e V
  • PHONCLASS p P

Each association phoneme/class definition is made of three columns: the first one is the key-word PHONCLASS, the second is the phoneme symbol (like defined in the tier with the phonemes, commonly X-SAMPA), the last column is the class symbol. The constraints on this definition are that a class-symbol is only one upper-case character, and that the character X if forbidden, and the characters V and W are reserved for vowels.

The second part of the configuration file contains the rules. The first column is a keyword, the second one describes the classes between two vowels and the third column is the boundary location. The first column can be:

  • GENRULE
  • EXCRULE
  • OTHRULE.

In the third column, a 0 means the boundary is just after the first vowel, 1 means the boundary is one phoneme after the first vowel, etc. Here are some examples of the file for French language:

  • GENRULE VXV 0
  • GENRULE VXXV 1
  • EXCRULE VFLV 0
  • EXCRULE VOLGV 0

Finally, to adapt the rules to specific situations that the rules failed to model, we introduced some phoneme sequences and the boundary definition. Specific rules contain only phonemes or the symbol ANY which means any phoneme. It consists of seven columns: the first one is the key-word OTHRULE, the five following columns are a phoneme sequence where the boundary should be applied to the third one by the rules, the last column is the shift to apply to this boundary. In the following example:

OTHRULE ANY ANY p s k -2

More information are available in (Bigi et al. 2010).

Support of a new language

The support of a new language in this automatic syllabification only consists in adding a configuration file (see previous section). Fix properly the encoding (utf-8) and newlines (LF) of this file; then fix the name and extension of the file as follows:

  • syllConfig- followed by language name with iso639-3 standard,
  • with extension .txt.

Perform Syllabification with the GUI

It is an annotation of STANDALONE type.

The Syllabification process takes as input a file that strictly match the audio file name except for the extension and that -palign is appended. For example, if the audio file name is oriana1.wav, the expected input file name is oriana1-palign.xra if .xra is the default extension for annotations. This file must include time-aligned phonemes in a tier with name PhonAlign.

The annotation provides an annotated file with -salign appended to its name, i.e. oriana1-salign.xra for the previous example. This file is including two tiers: SyllAlign, SyllClassAlign. Optionally, the program can add a tier with the syllable structures (V, CV, CCV…).

To perform the annotation, click on the Syllabification activation button, select the language and click on the Configure… blue text to fix options.

Perform Syllabification with the CLI

syllabify.py is the program to perform automatic syllabification of a given file with time-aligned phones.

Usage

syllabify.py [files] [options]
          Syllabification: Syllabification is based on a set of rules to convert
          phonemes into classes and to group them. Requires time-aligned phones.
          optional arguments:
            -h, --help            show this help message and exit
            --quiet               Disable the verbosity
            --log file            File name for a Procedure Outcome Report (default: None)
          Files (manual mode):
            -i file               Input time-aligned phonemes file name.
            -o file               Output file name with syllables.
          Files (auto mode):
            -I file               Input transcription file name (append).
            -l lang               Language code (iso8859-3). One of: fra ita pol.
            -e .ext               Output file extension. One of: .xra .TextGrid .eaf
                                  .csv .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx
                                  .arff .xrff
          Resources:
            -r rules              Configuration file with syllabification rules
          Options:
            --inputpattern INPUTPATTERN
                                  Input file pattern (time-aligned phonemes) (default:
                                  -palign)
            --outputpattern OUTPUTPATTERN
                                  Output file pattern (default: -syll)
            --usesphons USESPHONS
                                  Syllabify inside the IPU intervals (default: True)
            --usesintervals USESINTERVALS
                                  Syllabify inside an interval tier (default: False)
            --tiername TIERNAME   Tier name for such interval tier: (default:
                                  TokensAlign)
            --createclasses CREATECLASSES
                                  Create a tier with syllable classes (default: True)
          This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte
          Bigi. Contact the author at: contact@sppas.org

Examples of use


 python .\sppas\bin\syllabify.py -i .\samples\samples-fra\F_F_B003-P8-palign.xra
     -r .\resources\syll\syllConfig-fra.txt --quiet
      SyllAlign
      2.497101 2.717101 j-E-R
      2.717101 2.997101 s-w-A/-R
      ...
      19.412000 19.692000 P-L-V-P
      19.692000 20.010000 P-V-L-P
         

All the following commands will produce the same result:


 python .\sppas\bin\syllabify.py -I .\samples\samples-fra\F_F_B003-P8-palign.xra -l fra
 python .\sppas\bin\syllabify.py -I .\samples\samples-fra\F_F_B003-P8.TextGrid -l fra
 python .\sppas\bin\syllabify.py -I .\samples\samples-fra\F_F_B003-P8.wav -l fra
 python .\sppas\bin\syllabify.py -I .\samples\samples-fra\F_F_B003-P8 -l fra
         

TGA - Time Groups Analyzer

Overview

TGA is originally available at http://wwwhomes.uni-bielefeld.de/gibbon/TGA/. It’s a tool developed by Dafydd Gibbon, emeritus professor of English and General Linguistics at Bielefeld University.

Dafydd Gibbon (2013). TGA: a web tool for Time Group Analysis, Tools and Resources for the Analysis of Speech Prosody, Aix-en-Provence, France, pp. 66-69.

The original TGA is an online batch processing tool that provides a parametrized mapping from time-stamps in speech annotation files in various formats to a detailed analysis report with statistics and visualizations. TGA software calculates, inter alia, mean, median, rPVI, nPVI, slope and intercept functions within inter-pausal groups, provides visualization of timing patterns, as well as correlations between these, and parses inter-pausal groups into hierarchies based on duration relations. Linear regression is selected mainly for the slope function, as a first approximation to examining acceleration and deceleration over large data sets.

The TGA online tool was designed to support phoneticians in basic statistical analysis of annotated speech data. In practice, the tool provides not only rapid analyses but also the ability to handle larger data sets than can be handled manually.

In addition to the original one, a second version of TGA was implemented in the AnnotationPro software:

Katarzyna Klessa, Dafydd Gibbon (2014). Annotation Pro + TGA: automation of speech timing analysis, 9th International conference on Language Resources and Evaluation (LREC), Reykjavik (Iceland). pp. 1499-1505, ISBN: 978-2-9517408-8-4.

The integrated Annotation Pro + TGA tool incorporates some TGA features and is intended to support the development of more robust and versatile timing models for a greater variety of data. The integration of TGA statistical and visualization functions into Annotation Pro+TGA results in a powerful computational enhancement of the
existing AnnotationPro phonetic workbench, for supporting experimental analysis and modeling of speech timing.

So, what’s the novelty into the third version implemented into SPPAS…

First of all, it has to be noticed that TGA is only partly implemented into SPPAS. The statistics analyses tool of SPPAS allows to estimates TGA within the SPPAS framework; and it results in the following advantages:

  • it can read either TextGrid, csv, Elan, or any other file format supported by SPPAS,
  • it can save TGA results in any of the annotation file supported by SPPAS,
  • it estimates the two versions of the linear regression estimators: the original one and the one implemented into AnnotationPro:
    1. in the original TGA, the x-axis is based on positions of syllables,
    2. in the AnnotationPro+TGA, the x-axis is based on time-stamps.

Result of TGA into SPPAS

The annotation provides an annotated file with -tga appended to its name, i.e. oriana1-tga.xra for the example. This file is including 10 tiers:

  1. TGA-TimeGroups: intervals with the time groups
  2. TGA-TimeSegments: same intervals, indicate the syllables separated by whitespace
  3. TGA-Occurrences: same intervals, indicate the number of syllables
  4. TGA-Total: same intervals, indicate interval duration
  5. TGA-Mean: same intervals, indicate mean duration of syllables
  6. TGA-Median: same intervals, indicate median duration of syllables
  7. TGA-Stdev: same intervals, indicate stdev of duration of syllables
  8. TGA-nPVI: same intervals, indicate nPVI of syllables
  9. TGA-Intercept: same intervals, indicate the intercept
  10. TGA-Slope: same intervals, indicate the slope

Both tiers 9 and 10 can be estimated in two ways (so two more tiers can be generated).

Perform TAG with the GUI

It is an annotation of STANDALONE type.

The TGA process takes as input a file that strictly match the audio file name except for the extension and that -salign is appended. For example, if the audio file name is oriana1.wav, the expected input file name is oriana1-salign.xra if .xra is the default extension for annotations. This file must include time-aligned syllables in a tier with name SyllAlign.

To perform the annotation, click on the TGA activation button and click on the Configure… blue text to fix options.

Perform TGA with the CLI

tga.py is the program to perform TGA of a given file with time-aligned syllables.

Usage

tga.py [files] [options]
  TimeGroupAnalysis: Proposed by D. Gibbon, Time Group Analyzer calculates mean,
  median, nPVI, slope and intercept functions within inter-pausal groups.
  Requires time aligned syllables.
  optional arguments:
    -h, --help            show this help message and exit
    --quiet               Disable the verbosity
    --log file            File name for a Procedure Outcome Report (default: None)
  Files (manual mode):
    -i file               An input time-aligned syllables file.
    -o file               Output file name with TGA.
  Files (auto mode):
    -I file               Input time-aligned syllables file (append).
    -e .ext               Output file extension. One of: .xra .TextGrid .eaf
                          .csv .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx
                          .arff .xrff
  Options:
    --original ORIGINAL   Use the original estimation of intercept and slope
                          (default: False)
    --annotationpro ANNOTATIONPRO
                          Use the estimation of intercept and slope proposed in
                          AnnotationPro (default: True)
    --tg_prefix_label TG_PREFIX_LABEL
                          Prefix of each time group label: (default: tg_)
    --with_radius WITH_RADIUS
                          Duration estimation: Use 0 to estimate syllable
                          durations with midpoint values, use -1 for Radius-, or
                          1 for Radius+. (default: 0)
  This program is part of SPPAS version 2.0. Copyright (C) 2011-2019 Brigitte
  Bigi. Contact the author at: contact@sppas.org

Example of use

python .\sppas\bin\tga.py -i .\samples\samples-fra\F_F_B003-P8-syll.xra
      2018-12-20 08:35:21,219 [INFO] Logging set up level=15
      TGA-TimeGroups
      2.497101 5.683888 tg_1
      5.743603 8.460596 tg_2
      9.145000 11.948531 tg_3
      12.494000 13.704000 tg_4
      13.784000 15.036000 tg_5
      16.602000 20.010000 tg_6
      TGA-TimeSegments
      ...
      13.784000 15.036000 -0.03063
      16.602000 20.010000 0.00468

Other commands:

python .\sppas\bin\tga.py -I .\samples\samples-fra\F_F_B003-P8-syll.xra
  python .\sppas\bin\tga.py -I .\samples\samples-fra\F_F_B003-P8.TextGrid
  python .\sppas\bin\tga.py -I .\samples\samples-fra\F_F_B003-P8.wav

Stop words

Create a tier with True/False indicating if a token is a stop-word or not.

Self-Repetitions

Overview

This automatic detection focus on word self-repetitions which can be exact repetitions (named strict echos) or repetitions with variations (named non-strict echos). The system is based only on lexical criteria. The algorithm is focusing on the detection of the source.

This system can use a list of stop-words in a given language. This is a list of the most frequent words like adjectives, pronouns, etc. The result of the automatic detection is significantly better if such a list of stopwords is available.

Optionnally, SPPAS can add new stop-words in the list: they are deduced from the given data. These new entries in the stop-list are then different for each file (Bigi et al. 2014).

The annotation provides one annotated file with two to four tiers:

  1. TokenStrain: if a replacement file was available, it’s the entry used by the system
  2. StopWord: if a stop-list was used, it indicates if the token is a stop-word (True or False)
  3. SR-Sources: tags of the annotations are prefixed by S followed an index
  4. SR-Repetitions: tags of the annotations are prefixed by R followed an index

Adapt to a new language

The list of stop-words of a given language must be located in the vocab folder of the resources directory with .stp extension. This file is with UTF-8 encoding and LF for newline.

Perform Self-Repetitions with the GUI

It is an annotation of STANDALONE type.

The automatic annotation takes as input a file with (at least) one tier containing the time-aligned tokens. The annotation provides one annotated file with two tiers: Sources and Repetitions.

Click on the Self-Repetitions activation button, select the language and click on the Configure… blue text to fix options.

Perform SelfRepetitions with the CLI

selfrepetition.py is the program to perform automatic detection of self-repetitions.

Usage

selfrepetition.py [files] [options]
  Self-repetitions: Self-repetitions searches for sources and echos of a
  speaker. Requires time-aligned tokens.
  optional arguments:
    -h, --help            show this help message and exit
    --quiet               Disable the verbosity
    --log file            File name for a Procedure Outcome Report (default: None)
  Files (manual mode):
    -i file               Input time-aligned tokens file name.
    -o file               Output file name with syllables.
  Files (auto mode):
    -I file               Input transcription file name (append).
    -e .ext               Output file extension. One of: .xra .TextGrid .eaf
                          .csv .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx
                          .arff .xrff
  Resources:
    -r file               List of stop-words
  Options:
    --inputpattern INPUTPATTERN
                          Input file pattern (time-aligned words or lemmas)
                          (default: -palign)
    --outputpattern OUTPUTPATTERN
                          Output file pattern (default: -srepet)
    --span SPAN           Span window length in number of IPUs (default: 3)
    --stopwords STOPWORDS
                          Add stop-words estimated from the given data (default:
                          True)
    --alpha ALPHA         Coefficient to add data-specific stop-words (default:
                          0.5)
  This program is part of SPPAS version 3.0. Copyright (C) 2011-2020 Brigitte
  Bigi. Contact the author at: contact@sppas.org

Examples of use

python .\sppas\bin\selfrepetition.py -i .\samples\samples-fra\F_F_B003-P8-palign.xra
     -r .\resources\vocab\fra.stp
  python .\sppas\bin\selfrepetition.py -I .\samples\samples-fra\F_F_B003-P8.wav -l fra

Other-Repetitions

Overview

This automatic detection focus on other-repetitions, which can be either exact repetitions (named strict echos) or repetitions with variations (named non-strict echos). The system is based only on lexical criteria (Bigi et al.  2014). Notice that the algorithm is focusing on the detection of the source.

This system can use a list of stopwords in a given language. This is a list of frequent words like adjectives, pronouns, etc. The result of the automatic detection is significantly better if such a list of stopwords is available.

Optionnaly, SPPAS can add new stop-words in the list: they are deduced from the given data. These new entries in the stop-list are then different for each file (see Bigi et al. 2014).

The detection of the ORs is performed in a span window of N IPUs; by default, N is fixed to 5. It means that if a repetition is after these N IPUs, it won’t be detected. Technically, it also means that SPPAS needs to identify the boundaries of the IPUs from the time-aligned tokens: the tier must indicate the silences with the # symbol.

A file with the following tiers will be created:

  • OR-Source: intervals with the number of the sources
  • OR-SrcStrain: intervals with the tokens of the sources
  • OR-SrcLen: intervals with the number of tokens in the source
  • OR-SrcType: intervals with the type of the echos of the sources
  • OR-Echo: intervals with the number of the echos

Adapt to a language and support of a new one

This system can use a list of stopwords in a given language. It must be located in the vocab folder of the resources directory with .stp extension. This file is with UTF-8 encoding and LF for newline.

Perform Other-Repetitions with the GUI

It is an annotation of INTERACTION type.

The automatic annotation takes as input a file with (at least) one tier containing the time-aligned tokens of the main speaker, and another file/tier with tokens of the interlocutor. The annotation provides one annotated file with two tiers: Sources and Repetitions.

Click on the Other-Repetitions activation button, select the language and click on the Configure… blue text to fix options.

Perform Other-Repetitions with the CLI

usage: otherrepetition.py -r stopwords [files] [options]

Files:

  -i file               Input file name with time-aligned tokens of the main
                          speaker.
    -s file               Input file name with time-aligned tokens of the
                          echoing speaker
    -o file               Output file name with ORs.

Options:

  --inputpattern INPUTPATTERN
                          Input file pattern (time-aligned words or lemmas)
                          (default: -palign)
    --outputpattern OUTPUTPATTERN
                          Output file pattern (default: -orepet)
    --span SPAN           Span window length in number of IPUs (default: 3)
    --stopwords STOPWORDS Add stop-words estimated from the given data
                          (default: True)
    --alpha ALPHA         Coefficient to add data-specific stop-words
                          (default: 0.5)

Re-Occurrences

This annotation is searching for re-occurrences of an annotation of a speaker in the next N annotations of the interlocutor. It is originally used for gestures in (M. Karpinski et al. 2018).

Maciej Karpinski, Katarzyna Klessa Methods, Tools and Techniques for Multimodal Analysis of Accommodation in Intercultural Communication CMST 24(1) 29–41 (2018), DOI:10.12921/cmst.2018.0000006

Perform Re-Occurrences with the GUI

The automatic annotation takes as input any annotated file with (at least) one tier, and another file+tier of the interlocutor. The annotation provides one annotated file with 2 tiers: Sources and Repetitions.

Click on the Re-Occurrences activation button, and click on the Configure… blue text to fix options.

Perform Re-Occurrences with the CLI

usage: reoccurrences.py [files] [options]

Files:

  -i file               Input file name with time-aligned annotations of
                          the main speaker.
    -s file               Input file name with time-aligned annotations of
                          the interlocutor
    -o file               Output file name with re-occurrences.

Options:

  --inputpattern INPUTPATTERN
                          Input file pattern (default: )
    --outputpattern OUTPUTPATTERN
                          Output file pattern (default: -reocc)
    --tiername TIERNAME   Tier to search for re-occurrences (default: )
    --span SPAN           Span window length in number of annotations (default:
                          10)

This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte Bigi. Contact the author at: contact@sppas.org

Momel (modelling melody)

Momel is an algorithm for the automatic modeling of fundamental frequency (F0) curves using a technique called asymetric modal quadratic regression.

This technique makes it possible by an appropriate choice of parameters to factor an F0 curve into two components:

  • a macro-prosodic component represented by a a quadratic spline function defined by a sequence of target points < ms, hz >.
  • a micro-prosodic component represented by the ratio of each point on the F0 curve to the corresponding point on the quadratic spline function.

For details, see the following reference:

Daniel Hirst and Robert Espesser (1993). Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de l’Institut de Phonétique d’Aix. vol. 15, pages 71-85.

The SPPAS implementation of Momel requires a file with the F0 values sampled at 10 ms. Two file formats are supported:

  • .PitchTier, from Praat.
  • .hz, from any tool. It is a file with one F0 value per line.

The following options can be fixed:

  • Window length used in the cible method
  • F0 threshold: Maximum F0 value
  • F0 ceiling: Minimum F0 value
  • Maximum error: Acceptable ratio between two F0 values
  • Window length used in the reduc method
  • Minimal distance
  • Minimal frequency ratio
  • Eliminate glitch option: Filter f0 values before cible

Perform Momel with the GUI

It is an annotation of STANDALONE type.

Click on the Momel activation button then click on the Configure… blue text to fix options.

Perform Momel with the CLI

momel.py is the program to perform Momel annotation of a given file with F0 values sampled at 10ms.

Usage

momel.py [files] [options]
  Momel: Proposed by D. Hirst and R. Espesser, Momel - Modelling of fundamental
  frequency (F0) curves is using a technique called assymetric modal quaratic
  regression. Requires pitch values.
  optional arguments:
    -h, --help       show this help message and exit
    --quiet          Disable the verbosity
    --log file       File name for a Procedure Outcome Report (default: None)
  Files (manual mode):
    -i file          Input file name (extension: .hz or .PitchTier)
    -o file          Output file name (default: stdout)
  Files (auto mode):
    -I file          Input file name with pitch (append).
    -e .ext          Output file extension. One of: .xra .TextGrid .eaf .csv
                     .mrk .txt .stm .ctm .lab .mlf .sub .srt .antx .arff .xrff
  Options:
    --outputpattern OUTPUTPATTERN
                          Output file pattern (default: -momel)
    --win1 WIN1      Target window length (default: 30)
    --lo LO          F0 threshold (default: 50)
    --hi HI          F0 ceiling (default: 600)
    --maxerr MAXERR  Maximum error (default: 1.04)
    --win2 WIN2      Reduce window length (default: 20)
    --mind MIND      Minimal distance (default: 5)
    --minr MINR      Minimal frequency ratio (default: 0.05)
  This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte
  Bigi. Contact the author at: contact@sppas.rg

Examples of use

python .\sppas\bin\momel.py -i .\samples\samples-eng\ENG_M15_ENG_T02.PitchTier
      2018-12-19 15:44:00,437 [INFO] Logging set up level=15
      2018-12-19 15:44:00,674 [INFO]  ... ... 41 anchors found.
      1.301629 109.285503
      1.534887 126.157058
      1.639614 143.657446
      1.969234 102.911464
      2.155284 98.550759
      2.354162 108.250869
      2.595364 87.005994
      2.749773 83.577924
      2.933222 90.218382
      3.356651 119.709142
      3.502254 104.104568
      3.707747 132.055286
      4.000578 96.262109
      4.141915 93.741407
      4.383332 123.996736
      4.702203 89.152708
      4.987086 101.561180
      5.283864 87.499710
      5.538984 92.399690
      5.707147 95.411586
      5.906895 87.081095
      6.705373 121.396919
      7.052992 130.821479
      7.218415 120.917642
      7.670083 101.867028
      7.841935 109.094053
      8.124574 90.763267
      8.455182 114.261067
      8.746016 93.704705
      9.575359 101.108444
      9.996245 122.488120
      10.265663 105.244429
      10.576394 94.875460
      11.730570 99.698799
      12.083323 124.002313
      12.411790 108.563104
      12.707442 101.928297
      12.963805 113.980850
      13.443483 90.782781
      13.921939 90.824376
      14.377324 60.126506

Apply Momel on all files of a given folder:

python .\sppas\bin\momel.py -I .\samples\samples-eng

INTSINT: Encoding of F0 anchor points

INTSINT assumes that pitch patterns can be adequately described using a limited set of tonal symbols, T,M,B,H,S,L,U,D (standing for : Top, Mid, Bottom, Higher, Same, Lower, Up-stepped, Down-stepped respectively) each one of which characterises a point on the fundamental frequency curve.

The rationale behind the INTSINT system is that the F0 values of pitch targets are programmed in one of two ways : either as absolute tones T, M, B which are assumed to refer to the speaker’s overall pitch range (within the current Intonation Unit), or as relative tones H, S, L, U, D assumed to refer only to the value of the preceding target point.

INTSINT example

The rationale behind the INTSINT system is that the F0 values of pitch targets are programmed in one of two ways : either as absolute tones T, M, B which are assumed to refer to the speaker’s overall pitch range (within the current Intonation Unit), or as relative tones H, S, L, U, D assumed to refer only to the value of the preceding target point.

A distinction is made between non-iterative H, S, L and iterative U, D relative tones since in a number of descriptions it appears that iterative raising or lowering uses a smaller F0 interval than non-iterative raising or lowering. It is further assumed that the tone S has no iterative equivalent since there would be no means of deciding where intermediate tones are located.

D.-J. Hirst (2011). The analysis by synthesis of speech melody: from data to models, Journal of Speech Sciences, vol. 1(1), pages 55-83.

Perform INTSINT with the GUI

It is an annotation of STANDALONE type.

Click on the INTSINT activation button and click on the Configure… blue text to fix options.

Perform INTSINT with the CLI

intsint.py is the program to perform INTSINT annotation of a given file with momel anchors.

Usage

intsint.py [files] [options]
  INTSINT: INternational Transcription System for INTonation codes the
  intonation of an utterance by means of an alphabet of 8 discrete symbols.
  Requires Momel targets.
  optional arguments:
    -h, --help  show this help message and exit
    --quiet     Disable the verbosity
    --log file  File name for a Procedure Outcome Report (default: None)
  Files (manual mode):
    -i file     Input file name with anchors.
    -o file     Output file name (default: stdout)
  Files (auto mode):
    -I file     Input file name with anchors (append).
    -e .ext     Output file extension. One of: .xra .TextGrid .eaf .csv .mrk
                .txt .stm .ctm .lab .mlf .sub .srt .antx .arff .xrff
  Options:
    --inputpattern INPUTPATTERN
                          Input file pattern (momel anchors) (default: -momel)
    --outputpattern OUTPUTPATTERN
                          Output file pattern (default: -intsint)
  This program is part of SPPAS version 2.4. Copyright (C) 2011-2019 Brigitte
  Bigi. Contact the author at: contact@sppas.org

Examples of use

Apply INTSINT on a single file and print the result on the standard output:

python .\sppas\bin\intsint.py -i .\samples\samples-eng\ENG_M15_ENG_T02-momel.xra --quiet
      1.301629 M
      1.534887 U
      1.639614 H
      1.969234 L
      2.155284 S
      2.354162 U
      2.595364 L
      2.749773 S
      2.933222 S
      3.356651 H
      3.502254 D
      3.707747 H
      4.000578 L
      4.141915 S
      4.383332 H
      4.702203 L
      4.987086 U
      5.283864 L
      5.538984 U
      5.707147 D
      5.906895 S
      6.705373 M
      7.052992 U
      7.218415 S
      7.670083 D
      7.841935 S
      8.124574 D
      8.455182 U
      8.746016 D
      9.575359 M
      9.996245 U
      10.265663 D
      10.576394 D
      11.730570 M
      12.083323 U
      12.411790 D
      12.707442 S
      12.963805 U
      13.443483 L
      13.921939 S
      14.377324 B

Apply INTSINT in auto mode:

python .\sppas\bin\intsint.py -I .\samples\samples-eng\ENG_M15_ENG_T02.wav
  python .\sppas\bin\intsint.py -I .\samples\samples-eng\ENG_M15_ENG_T02.PitchTier
  python .\sppas\bin\intsint.py -I .\samples\samples-eng\ENG_M15_ENG_T02-momel.xra

Hand&Pose

SPPAS is a wrapper for MediaPipe Hand detection and Mediapipe Pose detection. It also proposes a custom solution to detect right-left hands of a person.

Overview

MediaPipe Hands is a high-fidelity hand and finger tracking solution. It employs machine learning (ML) to infer 21 3D landmarks of a hand from just a single frame. MediaPipe Hands utilizes an ML pipeline consisting of multiple models working together: A palm detection model that operates on the full image and returns an oriented hand bounding box. A hand landmark model that operates on the cropped image region defined by the palm detector and returns high-fidelity 3D hand key points. For details about hand detection, see: https://google.github.io/mediapipe/solutions/hands.html

MediaPipe Pose is a Machine Learning solution for high-fidelity body pose tracking, inferring 33 3D landmarks. For details about pose detection, see: https://google.github.io/mediapipe/solutions/pose.html

Important:

  • SPPAS is only exporting (x,y)
  • SPPAS can export the results in either XRA or CSV formats.

Three detection mode are available:

  1. hands only: when enabling hands and disabling pose, SPPAS is a simple wrapper for MediaPipe Hand detection. The result of the system is a list of 21 sights for each detected hand.
  2. pose only: when disabling hand detection and enabling pose detection, SPPAS is a simple wrapper for MediaPipe Pose detection. Notice that if several persons are on the image, it will detect only one human body. The result of the system is a list with 33 sights of the human body.
  3. hand&pose: when enabling both hand and pose, SPPAS is combining both pose detections and hand detections. It results in two lists of sights: the first list for the right hand and the second list of the left hand. Each list is made either of 21 sights if the hand was properly detected by the hand prediction system, or made of the four sights of the pose detection system.

Here is the match between the indexes with height sights of the pose detection and the sight's indexes of the hand detection:

  • right hand: pose 16 is hand 0, pose 18 is hand 8, pose 20 is hand 20 and pose 22 is hand 4.
  • left hand: pose 15 is hand 0, pose 17 is hand 8, pose 19 is hand 20 and pose 21 is hand 4.

Perform annotation with the GUI

It is a STANDALONE annotation.

The annotation process takes as input an image or a video file. To perform the annotation, click on its activation button and click on the Configure… blue text to fix options.

Perform with the CLI

The CLI does not work on an image but video recordings only. To perform this annotation on an image, use the script annotation.py instead.

usage: handpose.py [files] [options]
  optional arguments:
    -h, --help            show this help message and exit
    --quiet               Disable the verbosity
    --log file            File name for a Procedure Outcome Report
                          (default: None)
  Files:
    -i file               Input video.
    -o file               Output base name.
    -I file               Input file name (append).
    -e .ext               Output file extension. One of: .mp4 .avi .mkv .mov
  Options:
    --inputpattern1 INPUTPATTERN1
                          Input pattern of the video file. (default: )
    --outputpattern OUTPUTPATTERN
                          Output file pattern with the sights. (default: -hands)
    --hand HAND           Enable hands detection. (default: True)
    --pose POSE           Enable pose detection -- for only one human body. (default: True)
    --csv CSV             Save points in a CSV file instead of XRA (default: False)
    --tag TAG             Draw points on the video (default: True)
    --folder FOLDER       Save result in a folder with image files -- if video input only (default:
 False)
  

Face Detection

This is a state-of-the-art implementation of face detection performed with freely available models and tools. We introduced an original method in order to use any or all of them and to combine their results. This combined result was not evaluated.

Overview

FaceDetection annotation allows searching for coordinates of faces in an image or in all images of a video. It requires both to enable the video feature in the setup to install the external libraries numpy and opencv-contrib-python. Optionally, mediapipe can also be installed. It also requires to check facedetect in the list of annotations to be installed at the time of the setup.

On the basis of the detection methods implemented in the opencv library, SPPAS is able to use several systems and to combine their results. These systems are based on 2 different methods:

  1. an Artificial Neural Network (DNN);
  2. an Haar Cascade Classifier (HCC).

The linguistic resources of this annotation include two DNN models and two models for HCC (a frontal-face model and a profile-face one). SPPAS can also launch the MediaPipe face detection system which is much faster than the previous ones but results sounds worse.

By default, SPPAS launches two of these detectors: 1 DNN and 1 HCC, and it combines their results. This annotation is about 2.5x real time. Even if it can increase the quality of the final result, other models are not used by default because the detection is very slow: 15x real time to use all the five models. The options allow choosing the models to be used.

Result of Face Detection

There are several output files that can be created:

  • a copy of the image/video with all the detected faces surrounded by a square indicating a confidence score;
  • as many cropped image files as the number of detected faces;
  • an XRA or a CSV file with coordinates and confidence score of each detected face.

There’s also the possibility to consider the selfie (portrait size) instead of the face.

Perform Face Detection with the GUI

It is a STANDALONE annotation.

The Face Detection process takes as input an image file and/or a video. To perform the annotation, click on the FaceDetection activation button and click on the Configure… blue text to fix options.

Perform Face Detection with the CLI

facedetection.py is the program to perform Face Detection annotation of a given media file.

Usage

usage: facedetection.py [files] [options]
  optional arguments:
    -h, --help            show this help message and exit
    --quiet               Disable the verbosity
    --log file            File name for a Procedure Outcome Report (default: None)
  Files:
    -i file               Input image.
    -o file               Output base name.
    -I file               Input file name (append).
    -r model              Model base name (.caffemodel or .xml models as wishes)
    -e .ext               Output file extension (image or video)
  Options:
    --inputpattern INPUTPATTERN
                          Input file pattern (default: )
    --nbest NBEST         Number of faces to select among those
                          detected (0=auto) (default: 0)
    --score SCORE         Minimum confidence score to select detected
                          faces (default: 0.2)
    --portrait PORTRAIT   Consider the portrait instead of the face in
                          outputs (default: False)
    --csv CSV             Save coordinates of detected faces in a CSV
                          file instead of XRA (default: False)
    --folder FOLDER       Save result in a folder with image files --
                          if video input only (default: False)
    --tag TAG             Surround the detected faces in the output
                          image (default: True)
    --crop CROP           Save detected faces in cropped images
                          (default: False)
    --width WIDTH         Resize all the cropped images to a fixed
                          width (0=no) (default: 0)
    --height HEIGHT       Resize all the cropped images to a fixed
                          height (0=no) (default: 0)
    --model:opencv_face_detector_uint8.pb MODEL
                          Enable the opencv's ANN TensorFlow model.
                          (default: True)
    --model:haarcascade_frontalface_alt.xml MODEL
                          Enable the opencv's HaarCascade Frontal face model.
                          (default: True)
    --model:res10_300x300_ssd_iter_140000_fp16.caffemodel MODEL
                          Enable the opencv's ANN Caffe model.
                          (default: False)
    --model:haarcascade_profileface.xml MODEL
                          Enable the opencv's HaarCascade Profile face model.
                          (default: False)
    --model:mediapipe MODEL:MEDIAPIPE
                          Enable the MediaPipe Face Detection system.
                          (default: False)
  This program is part of SPPAS version 4.2. Copyright (C) 2011-2021
  Brigitte Bigi. Contact the author at: contact@sppas.org

Examples of use

python3 ./sppas/bin/facedetection.py -I ./samples/faces/BrigitteBigi_Aix2020.png
 --tag=True --crop=True --csv=True --portrait=True
  [INFO] Logging redirected to StreamHandler (level=0).
  [INFO] SPPAS version 3.5
  [INFO] Copyright (C) 2011-2021 Brigitte Bigi
  [INFO] Web site: https://sppas.org/
  [INFO] Contact: Brigitte Bigi (contact@sppas.org)
  [INFO]  * * * Annotation step 0 * * *
  [INFO] Number of files to process: 1
  [INFO] Options:
  [INFO]  ... inputpattern:
  [INFO]  ... outputpattern: -face
  [INFO]  ... nbest: 0
  [INFO]  ... score: 0.2
  [INFO]  ... portrait: True
  [INFO]  ... csv: True
  [INFO]  ... tag: True
  [INFO]  ... crop: True
  [INFO]  ... width: 0
  [INFO]  ... height: 0
  [INFO] File BrigitteBigi_Aix2020.png: Valid.
  [INFO]  ...  ... 3 faces found.
  [INFO]  ... ./samples/faces/BrigitteBigi_Aix2020-face.jpg

It creates the following 5 files in the samples/faces folder:

  • BrigitteBigi_Aix2020-face.jpg with the 3 detected faces sourrounded by a square and indicating the detection score
  • BrigitteBigi_Aix2020_1-face.jpg: image of the face with the highest score
  • BrigitteBigi_Aix2020_2-face.jpg: image of the face with the medium score
  • BrigitteBigi_Aix2020_3-face.jpg: image of the face with the worse score
  • BrigitteBigi_Aix2020-face.csv contains three lines with coordinates in columns 3-7 with (x,y) and (w,h) then the confidence score in column 8 ranging [0., 1.]

Notice that the image contains three faces and their positions are properly found.

Face Identity

This is a new and original automatic annotation, but it’s still in progress. It has to be evaluated.

Overview

Face Identity automatic annotation assigns a person identity to detected faces of a video. It takes as input a video and a CSV file with coordinates of the detected faces. It produces a CSV file with coordinates of the identified faces. Assigned person's names are id-00x. Obviously, the CSV file can be edited, and such names can be changed a posteriori.

This annotation requires enabling the video feature in the setup, because it requires the external python libraries numpy and opencv-contrib.

No external resources are needed.

Perform annotation with the GUI

It is a STANDALONE annotation.

The Face Identity process takes as input a video file. To perform the annotation, click on the Face Identity activation button and click on the Configure… blue text to fix options.

Perform with the CLI

faceidentity.py is the program to perform Face Identity annotation of a given video file, if the corresponding CSV file with detected faces is existing.

Usage

usage: faceidentity.py [files] [options]
  optional arguments:
    -h, --help            show this help message and exit
    --quiet               Disable the verbosity
    --log file            File name for a Procedure Outcome Report (default: None)
  Files:
    -i file               Input video.
    -c file               Input CSV file with face coordinates and sights.
    -o file               Output base name.
    -I file               Input file name (append).
    -e .ext               Output file extension. One of: .mp4 .avi .mkv
  Options:
    --inputpattern INPUTPATTERN
    --inputoptpattern INPUTOPTPATTERN (default: -face)
    --outputpattern OUTPUTPATTERN (default: -ident)

Face Landmarks

This is a state-of-the-art implementation of face landmark performed with freely available models and tools. We introduced a solution to combine the results when several methods are used. The combined result was not evaluated.

Overview

SPPAS is using both the MediaPipe Face Mesh and the OpenCV’s facial landmark API called Facemark. It includes three different implementations of landmark detection based on three different papers:

  • FacemarkKazemi: This implementation is based on a paper titled One Millisecond Face Alignment with an Ensemble of Regression Trees by V.Kazemi and J. Sullivan published in CVPR 2014.
  • FacemarkAAM: This implementation uses an Active Appearance Model and is based on the paper titled Optimization problems for fast AAM fitting in-the-wild by G. Tzimiropoulos and M. Pantic, published in ICCV 2013.
  • FacemarkLBF: This implementation is based a paper titled Face alignment at 3000 fps via regressing local binary features by S. Ren published in CVPR 2014.

The fundamental concept is that any person will have 68 particular points on the face (called sights). SPPAS is able to launch several of them and to combine their results in a single and hopefully better one. Actually, SPPAS is launching MediaPipe Face Mesh and extracting the 68 sights among the 468 that are detected; then this result is combined (weight=6) with the 68 sights of the LBF detection method (weight=1).

This annotation requires both to enable the video feature in the setupto install the external libraries numpy and opencv-contrib, and to check facemark in the list of annotations to be installed. Two different models will be downloaded and used: a Kazemi one and a LBF one.

Perform annotation with the GUI

It is a STANDALONE annotation.

The Face Sights process takes as input an image file and/or a video. To perform the annotation, click on the Face Sights activation button and click on the Configure… blue text to fix options.

Perform with the CLI

usage: facesights.py [files] [options]
  optional arguments:
    -h, --help            show this help message and exit
    --quiet               Disable the verbosity
    --log file            File name for a Procedure Outcome Report
  Files:
    -i file               Input image.
    -o file               Output base name.
    -I file               Input file name (append).
    -r model              Landmark model name (Kazemi, LBF or AAM)
    -R model              FaceDetection model name
    -e .ext               Output file extension.
  Options:
    --inputpattern INPUTPATTERN
    --inputoptpattern INPUTOPTPATTERN (default: -face)
    --outputpattern OUTPUTPATTERN (default: -sights)
  

Cued speech - LfPC

This automatic annotation is currently under development.

It’s an in-progress project and currently only a Proof of Concept is distributed. It mustn’t be used neither for any final application nor evaluation.

Cued Speech annotation can only be used in order to test it and
  to contribute in the project.

Overview

Speech reading or lip-reading requires watching the lips of a speaker and is used for the understanding of the spoken sounds. However, various sounds have the same lip movement which implies a lot of ambiguities. In 1966, R. Orin Cornett invented the Cued Speech, a visual system of communication. It adds information about the pronounced sounds that are not visible on the lips.

Thanks to this code, speech reading is encouraged since the Cued Speech (CS) keys match all the spoken phonemes but phonemes with the same movement have different keys. Actually, from both the hand position on the face (representing vowels) and hand shapes, known as cues (representing consonants), CV syllables can be represented. So, a single CV syllable will be generated or decoded through both the lips position and the key of the hand.

LfPC is the French acronym for Langue française Parlée Complétée.

The conversion of phonemes into keys of CS is performed using a rule-based system. This RBS phoneme-to-key segmentation system is based on the only principle that a key is always of the form CV.

This annotation requires both to enable the video feature in the setup to install the external libraries numpy and opencv-contrib and to check cuedspeech in the list of annotations.

Perform annotation with the GUI

It is a STANDALONE annotation.

The annotation process takes as input a -palign file and optionally a video. To perform the annotation, click on its activation button and click on the Configure… blue text to fix options.

Perform with the CLI

usage: cuedspeech.py [files] [options]
  optional arguments:
    -h, --help            show this help message and exit
    --quiet               Disable the verbosity
    --log file            File name for a Procedure Outcome Report
                          (default: None)
  Files:
    -i file               Input time-aligned phonemes file name.
    -v file               Input video file name.
    -o file               Output file name with Cued Speech key codes.
    -r rules              File with Cued Speech keys description
    -I file               Input file name (append).
    -l lang               Language code (iso8859-3). One of: fra.
    -e .ext               Output file extension. One of: .xra .TextGrid
                          .eaf .ant .antx .mrk .lab .srt .sub .vtt .ctm
                          .stm .csv .txt
  Options:
    --inputpattern1 INPUTPATTERN1
                          Pattern of the file with time-aligned phonemes
                          (default: -palign)
    --inputpattern2 INPUTPATTERN2
                          Pattern of the video file (default: )
    --inputpattern3 INPUTPATTERN3
                          Pattern of the file with sights of the face of
                          the person (default: -sights)
    --outputpattern OUTPUTPATTERN
                          Pattern of the output files (default: -cuedsp)
    --createvideo CREATEVIDEO
                          Tag the video with the code of the key
                          (needs video+csv) (default: False)