Introduction
About the author
- SPPAS is a scientific computer software package written and maintained by Brigitte Bigi
- CNRS researcher at the Laboratoire Parole et Langage, in Aix-en-Provence, France

- Research topics are related to:
- Multimodal corpora: collect, annotate, analyze
- Multilinguality
Corpus and annotation
- Corpus linguistics is the study of language as expressed in samples (corpora) of “real world”.
- Corpus annotation is a path to greater linguistic understanding.
Corpus annotation “can be defined as the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data. ‘Annotation’ can also refer to the end-product of this process” (Leech, 1997).
Annotations
- Annotation is not an end in itself: it is a basis for further analyses
- Annotations should be time-synchronized. Temporal information makes it
possible to describe simultaneous behaviors:
- of different levels in an utterance (e.g., prosody and locution)
- of different modalities (e.g. speech and gesture)
- of different speakers or extralinguistic events

Annotation software
- Manual annotation
- Audacity, Praat, Elan, Anvil, Winpitch, AnnotationPro…
- Automatic annotation:
- The current state-of-the-art in Computational Linguistics allows many annotation tasks to be semi-
or fully-automated:
- MarsaTag (morpho-syntax for French)
- MediaPipe
- …
- The current state-of-the-art in Computational Linguistics allows many annotation tasks to be semi-
or fully-automated:
- Each annotation that can be done automatically must be done automatically!
- Because revising is faster and easier than annotating… if the automatic system is “good enough”.
Before using any automatic annotation tool/software, it is important to consider its error rate (where applicable) and to estimate how those errors will affect the purpose for the annotated corpora.
SPPAS is an awarded Research Software
- Research global approaches:
- methods as language-independent as possible;
- the possibility to adapt technologies to low-resourced languages.
- Award “Accessit” Special Jury Prize - 2022:
- The Ministry of Higher Education, Research and Innovation presented the 1st Open Science Awards for Open Source Research Software.
-
Multi-Lingual approaches to the automatic annotation of speech
- SPPAS is designed and developed to handle multiple language corpora and/or tasks with the same algorithms in the same software environment.
- SPPAS emphasizes new practices in the methodology of tool developments:
- considering problems with a generic multilingual aspect,
- sharing resources,
- putting the end-users in control of their own computing.
- Only the resources are language-specific, and the approach is based on the simplest resources possible.
Resources extend
- Phoneticians are of crucial importance for resource development
- they can contribute to improve the resources used by automatic systems.
- New versions are systematically released to the public and serve to benefit of the whole community.
Resources extend (continued)
- Resources are distributed under the terms of a public license, so that SPPAS users:
- have free access to the application source code and the resources of the software they use,
- are free to share the software and resources with other people,
- are free to modify the software and resources,
- are free to publish their modified versions of the software and resources.
SPPAS: Main reference to cite
Brigitte Bigi (2015).
SPPAS - Multi-lingual Approaches to the Automatic Annotation of Speech.
In "the Phonetician" - International Society of Phonetic Sciences,
ISSN 0741-6164, Number 111-112 / 2015-I-II, pages 54-69.

You reached the end of this tutorial!
Corpus creation methodology
The context
- SPPAS was created in the scope of annotating automatically and exploring large multi-modal corpora
- A workflow was defined while creating CID - Corpus of Interactional Data
- It was then applied to create new corpora

The corpus creation workflow

Step 1: Recording speech
- One channel per speaker
- Anechoic room, or an environment with no/low noise
- Audio: for automatic annotation tools:
- Any uncompressed file format, commonly WAVE
- 16000Hz / 16bits is enough
- Audio: for manual annotation tools:
- Any un-compressed file format
- 48000Hz / 16bits is of high quality
Step 2: Search for Inter-Pausal Units
- Parameters to define manually:
- fix the minimum silence duration
- fix the minimum speech duration
- As results:
- speech and silences are time-aligned automatically
- a manual verification is highly recommended
Step 3: Orthographic Transcription
- Any transcription must follow a convention
- In speech (particularly in spontaneous speech), many phonetic variations occur:
- some of these phonologically known variants are predictable
- but many others are still unpredictable (especially invented words, regional words or words borrowed from another language)
- The orthographic transcription must be enriched:
- it must be a representation of what is “perceived” in the signal.
Step 4: Speech segmentation
- In SPPAS, this problem is divided into three different tasks:
- Text normalization
- Phonetization
- Alignment
- Allows a fully automatic annotation, or a semi-automatic one
- See the tutorial ‘Phonemes and words segmentation’ for details
Other steps:
- On the basis of the time-aligned phonemes/tokens, other automatic annotations:
- Syllables
- Repetitions
- TGA
- …
- Analysis:
- Filter data
- Statistics
- …
Interested in knowing more?
- You can see a full presentation here if you are interested in knowing more. It was created in 2015 and wasn’t updated since! However, it is general enough and can be of great help:
You reached the end of this tutorial!
Data preparation for automatic annotations
(Step 1) Recording Speech
Recording audio
- The resolution of the capture devices (microphones, framerate, file format, software) has a determining influence on the quality of the corpus, and so on the annotations.
- Lack of standardization means that fewer researchers will be able to work with those signals.



Recording audio: software tools
- A short list of software tool we already tested and checked:


Recording speech: SPPAS requirements
wav
andau
audio file formats- an audio file is MONO: only one channel
- good recording quality is expected
- Example:

(Step 2) Inter-Pausal Units
IPUs = sounding segments
- The orthographic transcription must be pre-segmented

How to do it?
- SPPAS can automatically perform the IPUs segmentation
- then manual verification is recommended
- see the related tutorial and/or the documentation for details
- but IPUs will be segmented manually:
- if the audio signal is of poor quality;
- if there are more than one speaker recorded in the same channel.
(Step 3) Transcribing Speech
Orthographic transcription:
- must include:
- Filled pauses
- Short pauses
- Repeats
- Truncated words
- Noises
- Laughter
- should include:
- un-regular elisions
- un-regular liaisons
- specific pronunciations
- may include
- all elisions
Orthographic transcription: SPPAS convention
- truncated words, noted as a '-' at the end of the token string (an ex- example)
- noises, noted by a ’*’
- laughter, noted by a ’@’
- short pauses, noted by a ’+’
- elisions, mentioned in parenthesis
- specific pronunciations, noted with brackets [example,eczap]
- comments are noted inside braces or brackets without using comma {this} or [this and this]
- liaisons, noted between ’=’ (an =n= example)
- morphological variants with <like,lie ok>
- proper name annotation, like $John S. Doe$
Transcription example 1 (Conversational speech)
- Manual orthographic transcription:
donc + i- i(l) prend la è- recette et tout bon i(l) vé- i(l) dit bon [okay, k]
- Automatically extracted standard orthograph:
- donc il prend la recette et tout bon il dit bon okay
- Automatically extracted tokens:
- donc + i i prend la è recette et tout bon i vé i dit bon k
Transcription example 2 (Conversational speech)
- Manual orthographic transcription:
ah mais justement c’était pour vous vendre bla bla bla bl(a) le mec i(l) te l’a emboucané + en plus i(l) lu(i) a [acheté,acheuté] le truc et le mec il est parti j(e) dis putain le mec i(l) voulait
- Automatically extracted standard orthography:
- ah mais justement c’était pour vous vendre bla bla bla bla le mec il te l’a emboucané en plus il lui a acheté le truc et le mec il est parti je dis putain le mec il voulait
- Automatically extracted tokens:
- ah mais justement c’était pour vous vendre bla bla bla bl le mec i te l’a emboucané + en plus i lu a acheuté le truc et le mec il est parti j dis putain le mec i voulait
Transcription example 3 (GrenelleII)
- Manual orthographic transcription:
euh les apiculteurs + et notamment b- on ne sait pas très bien + quelle est la cause de mortalité des abeilles m(ais) enfin il y a quand même + euh peut-êt(r)e des attaques systémiques
- Automatically extracted standard orthography:
- les apiculteurs et notamment on ne sait pas très bien quelle est la cause de mortalité des abeilles mais enfin il y a quand même peut-être des attaques systémiques
- Automatically extracted faked orthography:
- euh les apiculteurs + et notamment b on ne sait pas très bien + quelle est la cause de mortalité des abeilles m enfin il y a quand même + euh peut-ête des attaques systémiques
Annotated files: recommendations
- UTF-8 encoding only
- only US-ASCII characters in file names (nor in the path obviously)
- Supported file formats to open/save (software, extension):
- SPPAS: xra
- Praat: TextGrid, PitchTier, IntensityTier
- Elan: eaf
- AnnotationPro: antx
- HTK: lab, mlf
- Sclite: ctm, stm
- Phonedit: mrk
- Excel/OpenOffice/R/…: csv
- Subtitles: sub, srt
- Supported file formats for import only (software, extension):
- Transcriber: trs
- Anvil: anvil
- Xtrans: tdf
- Audacity: txt
Supported file formats

Always remember…
- Each step - from collecting the recordings to the analyses of the annotations, depends on the previous one, and:
Garbage in, garbage out