Corpus and annotation

Corpus linguistics is the study of language as expressed in samples (corpora) of "real world".
Corpus annotation is a path to greater linguistic understanding and rigour:
- The annotation of recordings is practised by many Linguistics sub-fields, such as Phonetics, Prosody, Gesture or Discourse...
- Corpora are annotated with detailed information at various linguistic levels thanks to annotation software(s).
- New requirements are emerging for very large multimodal corpora where manual analysis is impractical.

Multi-domain annotations

Must be time-synchronized:
- annotations need to be time-aligned in order to be useful for purposes such as qualitative or quantitative analyses
Temporal information makes it possible to describe simultaneous behaviours:
- of different levels in an utterance (e.g. prosody and locution)
- of different modalities (e.g. speech and gesture)
- of different speakers or extralinguistic events
Time-analysis of multi-level annotations can reveal linguistic structures
Annotation requires software

Annotation software

Manual annotation
Automatic annotation
- The current state-of-the-art in Computational Linguistics allows many annotation tasks to be semi- or fully- automated.
But...
1. Despite these advances that have been achived for annotating and analysing language, many annotation frameworks and/or models for the construction and analysis of multimodal data continue to rely on "low-tech" and/or manual technologies.
2. Interoperability: when such muti-layer corpora are to be created with existing task-specific annotation tools, a new problem arises: output formats of the annotation tools can differ considerably.

A methodology for annotation...

Annotation is not an end in itself - it is a basis for further analysis
Handling of 'Big data', consisting of large quantities of audio, audio-visual and other multimodal recordings, is beyond the capabilities of purely manual annotation and traditional manual statistical analysis and plotting
Two phases of automation are needed:
- The Automatic Annotator
- The Automatic Analyzer

Corpus annotation: Manual vs. Automatic

The wide range of annotations, from aligned transcripts to gaze to reference to gestural form, is costly to collect and to annotate, both in terms of time and money.
Each annotation that can be done automatically must be done automatically!
Why? Because revising is faster and easier than annotating... if the automatic system is "good enough".

Example of automatic time-alignment vs manual time-alignment

The Automatic Annotator (an example)

The Automatic Annotator time-aligns descriptive data for Tiers such as Phonetics, Prosody, Syntax, Discourse with the recorded signal:

Example of multi-level annotations: only the orthographic transcription is manual

The Automatic Analyzer (an example)

The output of the Automatic Annotator is usually manually post-edited before being input to the Automatic Analyzer
The Automatic Analyzer inputs time-aligned data and outputs a report
- about annotation Labels
- sequences of annotation Labels in annotation Tiers
- relations between Labels in sets of annotation Tiers
- with statistics
- with visualisations

Getting/Sharing a corpus

Maybe there is already a corpus you can use?
Data reporitories: depending on the research discipline, data can often be deposited in one or more data centers (or repositories) that will provide access to the data. These repositories may have specific requirements:
- subject/research domain
- data re-use and access
- file format and data structure, and
- metadata.
SLDR:
- Speech and Language Data Repository
- http://sldr.org
- gathering and sharing language data
- long-term preservation by CINES, an institutional archive site.

Corpora - Examples (created at LPL)

CID - Corpus of Interactional Data
GrenelleII corpus:
- http://sldr.org/sldr000744
Aix MapTask:
- http://sldr.org/sldr000732
- http://sldr.org/sldr000875
DVD corpus:
- http://sldr.org/sldr000891

Screenshots of 4 corpora (left to right): CID, GrenelleII, Aix MapTask, DVD

CID - Corpus of Conversational Data

Face-to-face conversations in French
Created by Roxane Bertrand and Béatrice Priego-Valverde
8 semi-guided dialogs (110,000 words)
Recorded in 2003 and 2005
Available at:
- http://sldr.org/sldr000027/
- http://sldr.org/sldr000720/
Corpus description: (Bertrand et al. 2008)
Multimodal annotations: (Blache et al. 2010)

CID - a pioneer

No annotation framework nor tools were available
Two many data to manually annotate at all levels!

Then...

an annotation scheme was developed for each annotation level
the framework I'm currently presenting was elaborated
automatic tools were adapted or designed
a multi-level request system was designed

... annotated either by LPL, LLING or LIMSI.

CID - Current annotations (1)

Enriched orthographic transcription (manual)
- time-aligned at the IPU level (automatic)

CID - Current annotations (2)

Time-aligned phonemes and tokens and events like noises, laughter (automatic)
Time-aligned syllables (automatic)

CID - Current annotations (3)

Prosodic contours (manual)
Momel - Modelization of melody (automatic)
INternational Transcription System for INTonation (automatic)

CID - Current annotations (4)

Morpho-syntax and syntax time-aligned at the token level (automatic);
Time-aligned lemmas (automatic);

CID - Current annotations (5)

Dysfluencies (manual)
Discourse and interaction (manual)
Other- and Self- Repetitions (semi-automatic)

AB:

CM:

AB-CM:

CID - Current annotations (6)

Gestures: postural, face, hands (manual)

CID - to summarize

8 face-to-face conversations
A very (very very) large number of time-aligned annotations
An annotation methodology and annotation tools/software
More than 80 publications in 2013

GrenelleII

Video downloaded from a FTP server (after authorization), a flv file with poor quality
Audio extracted from the video

GrenelleII: annotations

Enriched orthographic transcription (manual)
- time-aligned at the utterance level (automatic)
Time-aligned phonemes, tokens and events (automatic)
Time-aligned syllables (automatic)
Prosodic contours and intonation (manual)
Morpho-syntax time-aligned at the token level (automatic)
Self-repetitions (semi-automatic)
Interruptions (manual)

Aix Map-Task

A French Map-Task
Available at:
- http://sldr.org/sldr000732
- http://sldr.org/sldr000875
8 maps for each pair of speakers
2 recording sessions:
- 2002: Remote condition, 4 dialogs, audio
- 2013: Face-to-face condition, 5 dialogs, audio + video
- the same maps for both sessions
(Bard et al. 2013), (Gorish et al. 2014)

Aix Map-Task: Screenshot

Aix Map-Task: Annotations

Enriched orthographic transcription (manual)
- time-aligned at the utterance level (manual in 2002 / automatic in 2013)
Time-aligned phonemes and tokens and events (automatic)
Time-aligned syllables (automatic)
Feedback (semi-automatic)

Why a rigorous methodology?

Quick and dirty annotation is possible, unless you expect to:
1. Use automatic annotation software or tool
2. produce reliable annotations
3. perform complex analysis
4. re-use annotations
5. share the corpus and its annotations

Basic concepts

Corpus and annotation

Multi-domain annotations

Annotation software

A methodology for annotation...

Corpus annotation: Manual vs. Automatic

The Automatic Annotator (an example)

The Automatic Analyzer (an example)

Getting/Sharing a corpus

Corpora - Examples (created at LPL)

CID - Corpus of Conversational Data

CID - Extracts

CID - a pioneer

CID - Current annotations (1)

CID - Current annotations (2)

CID - Current annotations (3)

CID - Current annotations (4)

CID - Current annotations (5)

CID - Current annotations (6)

CID - to summarize

GrenelleII

GrenelleII: annotations

Aix Map-Task

Aix Map-Task: Screenshot

Aix Map-Task: Annotations

Why a rigorous methodology?

Summary

Basic concepts

Corpus and annotation

Multi-domain annotations

Annotation software

A methodology for annotation...

Corpus annotation: Manual vs. Automatic

The Automatic Annotator (an example)

The Automatic Analyzer (an example)

Getting/Sharing a corpus

Corpora - Examples (created at LPL)

CID - Corpus of Conversational Data

CID - Extracts

CID - a pioneer

CID - Current annotations (1)

CID - Current annotations (2)

CID - Current annotations (3)

CID - Current annotations (4)

CID - Current annotations (5)

CID - Current annotations (6)

CID - to summarize

GrenelleII

GrenelleII: annotations

GrenelleII: Multi-modal analysis

Aix Map-Task

Aix Map-Task: Screenshot

Aix Map-Task: Annotations

Why a rigorous methodology?

Summary