Automated Annotation of
Phonetics, Syntax and Prosody in a Spoken Corpus

Brigitte Bigi, CRHC


Atelier ACORPO, 12 juin 2025

Scientific Background of SPPAS

Parole et Langage : une recherche interdisciplinaire

Experimental data collection, annotation, and analysis

Centre d'Expérimentation de la Parole
https://www.lpl-aix.fr
  1. [1h] SPPAS: why, what for, and for whom?
    • Overview to get to know and understand SPPAS
    • Hands-on: align phonemes and words to the audio signal
  2. [15min] MarsaTag: syntactic analysis of spoken language
    • Overview
    • Hands-on: align syntax analysis to the audio signal
  3. [15min] Other phonetic and prosodic annotations
    • Overview
  4. [30min] Other features
    • Overview
    • Hands-on: file conversion and tiers filtering

PART 1

SPPAS: why, what for, and for whom?

Getting to Know SPPAS

SPPAS: The automatic annotation and analysis of speech

SPPAS-4.25 in a few numbers

30, 170, 550, 1400, 110k, 90k and 1.

Open Science award - 2022

SPPAS : accessit prix spécial du jury, concours science ouverte du logiciel libre

Getting SPPAS

SPPAS Citations

Google Scholar

SPPAS is only cited for its original phonetic segmentation function
... and almost exclusively by the initially targeted users!

Except one: SPPAS was used for LipSync in Final Fantasy VII Remake

Getting to Know Speech Segmentation

Speech Segmentation

the process of taking the orthographic transcription text of an audio speech segment, like IPUs*, and determining where particular phonemes/words occur in this speech segment.
Example of speech segmentation result (corpus CLeLfPC)

In 2011... Speech Segmentation

Reality Check: Two Disciplines, Two Practices

Two communities studying speech
Slide from the first SPPAS presentation at LPL in 2012

It's only one purpose but... multiple needs!

      

Reality Check: Two Disciplines, Two Practices

Two communities studying speech
Slide from the first SPPAS presentation at LPL in 2012

Getting SPPAS: Your Data, Your Way

SPPAS allows users to adapt the software to their needs

Theoretical positioning: consider methods as language- and task-independent as possible.

Solution example:
Resources can be modified

SPPAS resources are under open licenses and in readable file formats.

Example of open and editable resource:

Solution example:
Adapting to a New Language is 4 Resources to Create

All of this in a .zip file, writing the documentation, and that’s it!

Solution example:
a specific item for the hesitation ("fp")

si on regarde euh
euh la CSCE etc vous avez à jamais à aucun moment justement euh
euh au n- niveau des délimitations euh étatiques vous n'avez justement euh
euh cette superposition
euh donc euh
il y a euh
euh de ce point de vue là euh un flou et euh bon moi je suis assez euh

14 "euh" in 18 seconds...
In this corpus, they represent 6% of the tokens.

Solution example:
a specific item for laughter

j'ai emprunté des livres à la b.u. j'ai déjà reçu le mail comme quoi qu'il faut que je les rende je les ai même pas ouverts @ @ @ c’est clair * je te jure @ c’est ça @ moi aussi @ c’est pareil j'ai reçu genre mais en plus j'en ai commandé euh quoi peut être huit quoi rien de (en)fin ridicule quoi @ @ d'où j'ai le temps de lire déjà rien que un livre @ j'en ai commandé huit quoi rien à voir (en)fin n’importe quoi du coup euh du coup ouais
9 laughs in 19 seconds...
In this corpus, they represent 4% of the tokens.

Getting Ready for SPPAS: Preparing Your Data

Step 1: Recording speech

Step 2: Search for Inter-Pausal Units

Step 3: Orthographic Transcription

Step 3: Enriched Orthographic Transcription

A transcription convention that allows SPPAS to process speech events.
Hesitation:
Laughter:
Unknown or regional words:
Hypo-articulation:
Repairs, repetitions, truncated words:
Others: elisions, noises, etc.
non mais @ je sais pas tu ne tu te vois nous parler + on- moi je nous par- je n- @ je nous parlais

Result example

et ouais mais de toute façon et en plus c'est euh tu euh
Représentation de l'audio

Getting the Data Extract for Practice

Cheese! Corpus

Référence bibliographique de Cheese!

Step 1: MA-PC Dialog Extract

Audio waveform

Step 2: Search for IPUs

Audio waveform and IPUs

Step 3: IPUs Manually corrected and Transcribed (with Praat)

Audio waveform and Transcript

Step 3: IPUs Manually corrected and Transcribed with Praat

Audio waveform and Transcript

Getting started with SPPAS

Communicate with SPPAS

  1. The powerful solution is with Python and its API:
  2. Another reproducible solution is with the embedded Python programs:
    • See sppas/bin and sppas/scripts files
  3. The most often used solution is its Graphical User Interface (GUI):
    • Double-click sppas.bat under Windows, sppas.command for macOS/Linux
executable files

SPPAS GUI: the Log window

SPPAS displays its messages here instead of opening multiple modal dialogs each time you perform an action!

SPPAS GUI - Log window
SPPAS GUI - Log window
A red or an orange message should alert you

SPPAS GUI: the Main window is a notebook

SPPAS GUI - Welcome tab
SPPAS GUI - Main window - Welcome tab

List of ALL Tasks to Perform any Automatic Annotation
Speech segmentation included

  1. In the "Files" tab:
    • add the files to be annotated;
    • check them;
    SPPAS GUI - Files tab
  2. In the "Annotate" tab:
    • follows the steps;
    • click “Let’s go!”.
    SPPAS GUI - Files tab
  3. Wait and read the report (Yes, that’s really it-there's nothing more!)

STANDALONE Annotations Required for Speech Segmentation

  • Required tasks:
    1. Text normalization
    2. Phonetization
    3. Alignment


In the interface, check one or more of these annotations, then click:
“Let’s go!”.

Files after Speech Segmentation

SPPAS GUI - Files tab

PART 2

MarsaTag: syntactic analysis of spoken language

POS-Tagging

MarsaTag: a tool for French POS-Tagging


Stéphane Rauzy Reference MarsaTag - French POS-Tagging

MarsaTag plugin for SPPAS

  1. Download and install MarsaTag from Ortolang:
    https://www.ortolang.fr/market/tools/sldr000841

  2. Download and install MarsaTag-Plugin
    https://sppas.org/download.html
MarsaTag download plugin

Use MarsaTag plugin

  1. In the "Files" tab:
    • add the files to be annotated;
    • check them;
    SPPAS GUI - Files tab - check files with palign patterns
  2. In the "Plugins" tab:
    • click the plugin logo and follow instructions
    • click “Let’s go!”.
    SPPAS GUI - Plugins tab
  3. Wait and read the report

MarsaTag annotation result

SPPAS GUI - Edit tab - POS-tagging annotations

PART 3

Other phonetic and prosodic annotations

Momel and INTSINT

Proposed by Daniel Hirst, Directeur de Recherches Emeritus at CNRS's Laboratoire Parole et Langage.

Photo Daniel Hirst

Momel (modelling melody)

F0 F0 + momel anchors

INTSINT: an INternational Transcription System for INTonation

INTSINT

Data preparation

Praat to Pitch Praat Down To PitchTier

Momel and INTSINT annotation process

  1. In the "Files" tab:
    • add the files to be annotated;
    • check them;
  2. In the "Annotate" tab:
    • follows the steps, including checking "Momel" and "INTSINT STANDALONE annotations;
    • click “Let’s go!”.
  3. Wait and read the report

Momel annotation result

Praat Momel result

INTSINT annotation result

Praat INTSINT result

Syllables

Description

Syllables example

Syllables annotation process

  1. In the "Files" tab:
    • add the files to be annotated;
    • check them;
  2. In the "Annotate" tab:
    • follows the steps, including checking "Syllabification" STANDALONE annotation;
    • click “Let’s go!”.
  3. Wait and read the report

Syllables annotation result

SPPAS GUI - Edit Tab - Syllabification result

Time Group Analyzer

Proposed by Dafydd Gibbon, Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld, Germany.

Photo Dafydd Gibbon

Time Group Analyzer

Analysis of speech rhythm based on syllable durations

TGA annotation process

  1. In the "Files" tab:
    • add the files to be annotated;
    • check them;
  2. In the "Annotate" tab:
    • follows the steps, including checking "TGA" STANDALONE annotation;
    • click “Let’s go!”.
  3. Wait and read the report

TGA annotation result

SPPAS GUI - Edit Tab - TGA result

The full SPPAS annotation workflow

SPPAS workflow

PART 4

Other features

File conversion

SPPAS is interoperable by design

SPPAS formats

Convert files

  1. In the "Files" tab:
    • add the files to be converted;
    • check them;
  2. In the "Convert" tab:
    • Check the expected file format;
    • click “Perform the conversion”.
  3. Wait and read the report at the bottom of the tab

Convert tab

SPPAS Convert tab

Conversion result

SPPAS Convert tab - report

Filtering annotated data

Filtering annotated data

Filtering annotated data: example (1)

Filtering annotated data: example (2)

Annexes

References

About

License

Logo licence CC-by-NC-ND

This document is a creative work, the exclusive property of LPL, protected by French and international intellectual property law, and licensed under CC BY-NC-ND (Attribution / Non-Commercial / No Derivatives).

This license permits any distribution (sharing, copying, reproducing, distributing, communicating), except for commercial purposes, by any means and in any format, provided that the work is distributed without modification and in its entirety.

You are free to copy, distribute, and transmit this document, provided that you credit the SPPAS project.