The annotation workflow

Information

Limitation

Removing human from the process... a nod for whose who know!
Removing human from the process... a nod for whose who know!

Which annotations (in general)?

A very large number of dimensions have been annotated in the past on mono and multimodal corpora. To quote only a few, some frequent speech or language based annotations are speech transcript, segmentation into words, utterances, turns, or topical episodes, labeling of dialogue acts, and summaries; among video-based ones are gesture, posture, facial expression [...]. (Popescu-Belis, 2010)

Which annotations (in this tutorial)?

In this tutorial, we will report on:

  1. IPUs segmentation (automatic)
  2. Speech transcript (manual)
  3. Phonemes and words segmentation (automatic)
  4. Syllables segmentation (automatic)
  5. Repetitions detection (automatic)
  6. Morpho-syntax (automatic)
  7. Momel and INTSINT (automatic)
  8. Gestures (manual)

The annotation workflow: legend

Legend of the annotation workflow
Legend of the annotation workflow

The annotation workflow

The annotation workflow
The annotation workflow

The main principle is...

Garbage in, Garbage out.

Record

Capturing and recording multimodal data

The capture of multimodal corpora requires complex settings such as instrumented lecture and meetings rooms, containing capture devices for each of the modalities that are intended to be recorded, but also, most challengingly, requiring hardware and software for digitizing and synchronizing the acquired signals. (Popescu-Belis, 2010)

Recording Audio and Video

Recording Audio: some advice

Of course, provide 44100Hz

Recording Video: some advice

Synchronizing

A short list of software we already tested and checked:

IPUs Segmentation

IPUs Segmentation: definition

IPUs Segmentation: software

Example of IPUs segmentation: Silences are annotated with # and speech intervals are filled with ipu number
Example of IPUs segmentation: Silences are annotated with # and speech intervals are filled with ipu number

Orthographic Transcription

Orthographic Transcription

Orthographic Transcription

Orthographic Transcription

Orthographic Transcription for spontaneous speech

Enriched Orthographic Transcription

Enriched Orthographic Transcription

Enriched Orthographic Transcription

Enriched Orthographic Transcription: convention

Train you first to transcribe and to use the annotation software!
Train you first to transcribe and to use the annotation software!

SPPAS transcription convention

Transcription example 1 (Conversational speech)

donc + i- i(l) prend la è- recette et tout bon i(l) vé- i(l) dit bon [okay, k]

Transcription example 2 (Conversational speech)

ah mais justement c'était pour vous vendre bla bla bla bl(a) le mec i(l) te l'a emboucané + en plus i(l) lu(i) a [acheté,acheuté] le truc et le mec il est parti j(e) dis putain le mec i(l) voulait

Transcription example 3 (GrenelleII)

euh les apiculteurs + et notamment b- on n(e) sait pas très bien + quelle est la cause de mortalité des abeilles m(ais) enfin il y a quand même + euh peut-êt(r)e des attaques systémiques

Enriched Orthographic Transcription of 3 corpora

http://sldr.org/sldr000786
http://sldr.org/sldr000786

Orthographic Transcription... to sum up

The automatic systems must be adapted to deal with EOT

Phonemes/Tokens time-alignment

Phonemes and Tokens time-alignment

Tokenization

Tokenization in SPPAS

The main steps of the text normalization proposed in SPPAS are:

Tokenization in SPPAS

This is + hum... an enrich(ed) transcription {loud} number 1!

(Bigi 2011)

Phonetization

Converting from written text into actual sounds, for any language, cause several problems that have their origins in the relative lack of correspondence between the spelling of the lexical items and their sound contents.

Phonetization in SPPAS

(Bigi 2013)

Phonetization in SPPAS

By convention, spaces separate words, dots separate phones and pipes separate phonetic variants of a word. For example, the transcription utterance:

Impact of the Orthographic Transcription on automatic phonetization

Alignment

Time-alignment process
Time-alignment process

Manual alignment has been reported to take between 11 and 30 seconds per phoneme. (Leung and Zue, 1984)

How to perform Speech Segm. ?

  1. Many freely available tool boxes, i.e. Speech Recognition Engines that can perform Speech Segmentation
    • HTK - Hidden Markov Model Toolkit
    • CMU Sphinx
    • Open Source Large Vocabulary CSR Engine Julius
    • ...

How to perform Speech Segm. ?

  1. Wrappers for such tool boxes:
    • Prosodylab-Aligner: python+HTK
    • P2FA: python+HTK
    • ...
  2. Web-services:
    • WebMAUS
    • Train&Align
    • ...

How to perform Speech Segm. ?

SPPAS (python+Julius), available for English, French, Italian, Spanish, Catalan, Polish, Japanese, Mandarin Chinese, Taiwanese, Cantonese

Alignment results in SPPAS

Results on vowels of French conversational speech
Results on vowels of French conversational speech

Syllables segmentation

Syllabification by SPPAS

(Bigi et al. 2010)

Syllabification by SPPAS

Repetitions detection

Repetitions

(Bigi et al. 2014)

Repetitions

Morpho-syntax

Morpho-syntax

Morpho-syntax: conversational speech vs map-task

CID - conversational speech, versus Map-task speech
CID - conversational speech, versus Map-task speech

Example of Morpho-syntax in CID

Example of time-aligned morpho-syntax on conversational speech
Example of time-aligned morpho-syntax on conversational speech

Momel and INTSINT

Momel and INTSINT

INTSINT

Example of Momel and INTSINT

Momel and INTSINT: software

(Hirst and Espesser, 1993)

Gestures

Gestures: Annotation methodology

(Tellier 2014)

Summary