Information

The proposed methodology is designed for the human analyst (mostly researchers in Linguistics).
Therefore, we assume that the methodology is general enough to be useful for broad class of research applications.
Different analytical domains - e.g. speech and gesture - and theoretical perspective require a rigorous organization of the annotation procedure.

Limitation

The scope of the proposed workflow is broad and, therefore, complete coverage is challenging.
It is very unrealistic to consider that human analyst can be removed from the process of annotation.

Removing human from the process... a nod for whose who know!

Which annotations (in general)?

A very large number of dimensions have been annotated in the past on mono and multimodal corpora. To quote only a few, some frequent speech or language based annotations are speech transcript, segmentation into words, utterances, turns, or topical episodes, labeling of dialogue acts, and summaries; among video-based ones are gesture, posture, facial expression [...]. (Popescu-Belis, 2010)

Which annotations (in this tutorial)?

In this tutorial, we will report on:

IPUs segmentation (automatic)
Speech transcript (manual)
Phonemes and words segmentation (automatic)
Syllables segmentation (automatic)
Repetitions detection (automatic)
Morpho-syntax (automatic)
Momel and INTSINT (automatic)
Gestures (manual)

The annotation workflow: legend

The annotation workflow

The main principle is...

Garbage in, Garbage out.

Capturing and recording multimodal data

The capture of multimodal corpora requires complex settings such as instrumented lecture and meetings rooms, containing capture devices for each of the modalities that are intended to be recorded, but also, most challengingly, requiring hardware and software for digitizing and synchronizing the acquired signals. (Popescu-Belis, 2010)

Recording Audio and Video

The resolution of the capture devices (microphones, framerate, file format, software) has a determining influence on the quality of the corpus, and so on the annotations.

The number of devices is also important.
Lack of standardization means that fewer researchers will be able to work with those signals.

Recording Audio: some advice

One channel per speaker
Anechoic room, or an environment with no/low noise
Audio, for automatic annotation tools:
- Any un-compressed file format, commonly WAV
- 16000Hz is enough
Audio, for manual annotation tools:
- Any un-compressed file format
- Most of the time 20000Hz is enough:
  - prefer 32000Hz/48000Hz if an high-quality is required

Of course, provide 44100Hz

Recording Video: some advice

Video file format:
- refer to the annotation tool/software, and make tests!
- provide compressed file formats
- provide proprietary file formats
- prefer H.264, it's a standard
- prefer to record directly into the expected format (conversions are randomly good...)
Take care of the lights (prefer LED)
Pay attention to the noise the camera, the lights or the electricity power could generate

Synchronizing

Synchronization of the signals is a crucial feature
A regular "clap" (while recording) helps in this fastidious task (it's likely to "filming" the same clock on several signals).

Recommended tools/software

A short list of software we already tested and checked:

audacity (audio) http://audacity.sourceforge.net/
sox (audio) http://sox.sourceforge.net/
ffmpeg (audio+video) https://www.ffmpeg.org/

IPUs Segmentation: definition

Automatic segmentation in Inter-Pausal Units
- is also called Silence/Speech segmentation
Parameters to define manually:
- fix the minimum silence duration
- fix the minimum speech duration
- both values depend on:
  - the language
  - the speech style
As results:
- speech and silences are time-aligned and annotated automatically

IPUs Segmentation: software

SPPAS is recommended
A manual verification is recommended

Example of IPUs segmentation: Silences are annotated with # and speech intervals are filled with ipu number

Orthographic Transcription

Orthographic Transcription

An orthographic transcription is the minimum requirement for a speech corpus,
- a better representation of pronunciation may be desired for most of research questions
Orthographic transcription is at the top of the annotation procedure:
- and remember: "Garbage in, Garbage out."
Orthographic transcription of spoken language presents considerable challenges.

Orthographic Transcription

Speech may be annotated for:
- phonemic transcription;
- phonetic transcription taking into account details of pronunciation
  - allows a time-alignment at the phoneme-level
  - which is extended to time-alignment at word-level and syllable-level.
- syntax analysis

Orthographic Transcription

The better orthographic transcription implies:
- the better phonetic transcription,
- thus, the better time-alignment of phonemes,
- thus, the better time-alignment of tokens,
- thus, the better syllabification,
- and so on...
But, what is "the better" orthographic transcription?
1. it's a representation of what is “perceived” in the signal
2. it follows the convention the automatic system is requiring

Orthographic Transcription for spontaneous speech

One of the characteristics of Spontaneous Speech is an important gap between a word’s phonological form and its phonetic realizations.
Specific realizations due to elision or reduction processes are frequent in spontaneous data.
It also presents other types of phenomena such as:
- non-standard elisions,
- substitutions or addition of phonemes
- noises, laughter, ...
All of them intervene in the automatic system

Enriched Orthographic Transcription

In speech (particularly in spontaneous speech), many phonetic variations occur:
- Some of these phonologically known variants are predictable
- but many others are still unpredictable (especially invented words, regional words or words borrowed from another language)
The orthographic transcription must be enriched:
- it must be a representation of what is “perceived” in the signal.

Enriched Orthographic Transcription

In speech (particularly in spontaneous speech), many kind of events can occur like breathes, laughter, ...

Enriched Orthographic Transcription

An EOT must include, at least:
- Filled pauses
- Short pauses
- Repeats
- Truncated words
- Noises
- Laughter
An EOT must also include:
- un-regular elisions
- specific pronunciations
An EOT may include:
- all elisions

Enriched Orthographic Transcription: convention

Any EOT must follow a convention
The EOT is the input for automatic systems... and the transcription convention depends on the tool/software.
So... you must read the documentation before starting to transcribe!

Train you first to transcribe and to use the annotation software!

SPPAS transcription convention

truncated words, noted as a ’-’ at the end of the token string (an ex- example)
noises, noted by a ’*’
laughs, noted by a ’@’
short pauses, noted by a ’+’
elisions, mentioned in parenthesis
specific pronunciations, noted with brackets [example,eczap]
comments are noted inside braces or brackets without using comma {this} or [this and this]
liaisons, noted between ’=’ (an =n= example)
morphological variants with <like,lie ok>
proper name annotation, like $John S. Doe$

Transcription example 1 (Conversational speech)

EOT:

donc + i- i(l) prend la è- recette et tout bon i(l) vé- i(l) dit bon [okay, k]

derived Standard orthograph:
- donc il prend la recette et tout bon il dit bon okay
derived Faked orthograph:
- donc + i i prend la è recette et tout bon i vé i dit bon k

Transcription example 2 (Conversational speech)

EOT:

ah mais justement c'était pour vous vendre bla bla bla bl(a) le mec i(l) te l'a emboucané + en plus i(l) lu(i) a [acheté,acheuté] le truc et le mec il est parti j(e) dis putain le mec i(l) voulait

Standard orthograph:
- ah mais justement c'était pour vous vendre bla bla bla bla le mec il te l'a emboucané en plus il lui a acheté le truc et le mec il est parti je dis putain le mec il voulait
Faked orthograph
- ah mais justement c'était pour vous vendre bla bla bla bl le mec i te l'a emboucané + en plus i lu a acheuté le truc et le mec il est parti j dis putain le mec i voulait

Transcription example 3 (GrenelleII)

EOT:

euh les apiculteurs + et notamment b- on n(e) sait pas très bien + quelle est la cause de mortalité des abeilles m(ais) enfin il y a quand même + euh peut-êt(r)e des attaques systémiques

Standard orthograph:
- les apiculteurs et notamment on ne sait pas très bien quelle est la cause de mortalité des abeilles mais enfin il y a quand même peut-être des attaques systémiques
Faked orthograph:
- euh les apiculteurs + et notamment b on n sait pas très bien + quelle est la cause de mortalité des abeilles m enfin il y a quand même + euh peut-ête des attaques systémiques

Enriched Orthographic Transcription of 3 corpora

Orthographic Transcription... to sum up

An Enriched Orthographic Transcription is required
The EOT of a corpus must follow a transcription convention
Manual Standard orthographic transcription takes 15-20 minutes / minute of speech.
Manual Enriched orthographic transcription takes 30-45 minutes / minute of speech.

The automatic systems must be adapted to deal with EOT

Phonemes/Tokens time-alignment

Phonemes and Tokens time-alignment

A problem divided into 3 sub-tasks:
1. tokenization
  - text normalization, word segmentation
2. phonetization
  - grapheme to phoneme conversion
3. alignment
  - speech segmentation

Tokenization

Tokenization is also known as "Text Normalization".
Tokenization is the process of segmenting a text into tokens.
In principle, any system that deals with unrestricted text need the text to be normalized.
Automatic text normalization is mostly dedicated to written text, in the NLP community

Tokenization in SPPAS

The main steps of the text normalization proposed in SPPAS are:

Remove punctuation
Lower the text
Convert numbers to their written form
Replace symbols by their written form (like %, °, ...)
Word segmentation
- based on a lexicon.

Tokenization in SPPAS

From an EOT, SPPAS produces 2 outputs:
- standard: the text normalization of the standard transcription,
- faked: the test normalization of the faked transcription.
Example:

This is + hum... an enrich(ed) transcription {loud} number 1!

standard: this is hum an enriched transcription number one
faked: this is + hum an enrich transcription number one

(Bigi 2011)

Phonetization

Phonetization is also known as grapheme-phoneme conversion
Phonetization is the process of representing sounds with phonetic signs.
Phonetic transcription of text is an indispensable component of text-to-speech (TTS) systems and is used in acoustic modeling for automatic speech recognition (ASR) and other natural language processing applications.

Converting from written text into actual sounds, for any language, cause several problems that have their origins in the relative lack of correspondence between the spelling of the lexical items and their sound contents.

Phonetization in SPPAS

SPPAS implements a dictionary based-solution
- consists in storing a maximum of phonological knowledge in a lexicon.
- In this sense, this approach is language-independent.
The phonetization process is the equivalent of a sequence of dictionary look-ups
SPPAS implements a language-independent algorithm to phonetize unknown words.

(Bigi 2013)

Phonetization in SPPAS

By convention, spaces separate words, dots separate phones and pipes separate phonetic variants of a word. For example, the transcription utterance:

Impact of the Orthographic Transcription on automatic phonetization

In (Bigi et al. 2012), we compared 3 types of OT:
1. Standard orthographic transcription.
2. Enriched 1: Std-OT + short pauses, various noises, laughter, filled pauses, truncated words, repeats.
3. Enriched 2: Enriched 1 + elisions, particular pronunciations and unusual liaisons.
Evaluations compare a reference phonetized manually to phonetizations obtained with SPPAS

Alignment

Alignment is also called phonetic segmentation
The alignment problem consists in a time-matching between a given speech unit along with a phonetic representation of the unit.

Manual alignment has been reported to take between 11 and 30 seconds per phoneme. (Leung and Zue, 1984)

How to perform Speech Segm. ?

Many freely available tool boxes, i.e. Speech Recognition Engines that can perform Speech Segmentation
- HTK - Hidden Markov Model Toolkit
- CMU Sphinx
- Open Source Large Vocabulary CSR Engine Julius
- ...

How to perform Speech Segm. ?

Wrappers for such tool boxes:
- Prosodylab-Aligner: python+HTK
- P2FA: python+HTK
- ...
Web-services:
- WebMAUS
- Train&Align
- ...

How to perform Speech Segm. ?

Packaged software
- user-friendly,
- with Graphical User Interface,
- with Command-line Interface,
- documentated,
- maintained,
- open-source,
- etc...

SPPAS (python+Julius), available for English, French, Italian, Spanish, Catalan, Polish, Japanese, Mandarin Chinese, Taiwanese, Cantonese

Alignment results in SPPAS

In average, automatic speech segmentation of French is 95% of times within 40ms compared to the manual segmentation (SPPAS 1.5, September 2014):
- tested on read speech
- tested on conversational speech

Results on vowels of French conversational speech

Syllables segmentation

Syllabification by SPPAS

Automatic annotation
A rule-based system
Rules available for:
- French
- Italian
This phoneme-to-syllable segmentation system is based on 2 main principles:
1. a syllable contains a vowel, and only one;
2. a pause is a syllable boundary.

(Bigi et al. 2010)

Syllabification by SPPAS

Phonemes are grouped into classes, for both French and Italian:
- V - Vowels,
- G - Glides,
- L - Liquids,
- O - Occlusives,
- F - Fricatives,
- N - Nasals.
Fix rules to find the boundaries between two vowels

Repetitions detection

Repetitions

Other-repetition is a device involving the reproduction by a speaker of what another speaker has just said.
Other-repetition has been identified as an important mechanism in face-to-face conversation through their discursive or communicative functions

(Bigi et al. 2014)

Repetitions

Semi-automatic annotation performed by SPPAS
SPPAS implements:
- self-repetitions,
- other-repetitions detection (CLI only).
The system is based only on lexical criteria, from the time-aligned tokens (or lemmas)
The system was used to propose a lexical characterization of OR: various statistics was estimated on the detected OR

Morpho-syntax

It is mostly dedicated to written text, in the NLP community
A system must be adapted to deal with speech, particularly for conversational speech:
- spoken data are time-aligned and we expect to get a time-aligned morpho-syntax!
- the lexicon and the probabilities of tokens are different between written texts and speech, so they must be updated.
At LPL, Stéphane Rauzy and G. De Montcheuil are proposing MarsaTag, for French:
- http://sldr.org/sldr000841

Morpho-syntax: conversational speech vs map-task

Annotated by MarsaTag, version 0.8

CID - conversational speech, versus Map-task speech

Example of Morpho-syntax in CID

Example of time-aligned morpho-syntax on conversational speech

Momel and INTSINT

Momel (modelling melody)
- algorithm modelling raw fundamental frequency curves with a quadratic spline function
- target F₀ Points
INTSINT: an INternational Transcription System for INTonation
- based on an inventory of minimal pitch contrasts found in published descriptions of intonation patterns
- surface phonological structure
- mapping from Momel target points to INTSINT tones

INTSINT

Absolute tones: T(op) M(id) B(ottom)
Relative tones: H(igher) S(ame) L(ower)
Iterative relative tones: U(pstepped) D(ownstepped)

Example of Momel and INTSINT

Momel and INTSINT: software

Momel and INTSINT are available:
- as a Praat plugin, developped by Daniel Hirst
- in SPPAS, developped by Brigitte Bigi

(Hirst and Espesser, 1993)

Gestures: Annotation methodology

http://discours.revues.org/8917

(Tellier 2014)

Summary

Introduction
Selection of annotation software
Corpus development methodology
Momel and INTSINT
SPPAS
Time Group Analyzer
Conclusion and references