Corpus and annotation

Corpus linguistics is the study of language as expressed in samples (corpora) of "real world".
Corpus annotation is a path to greater linguistic understanding and rigour:
- The annotation of recordings is practised by many Linguistics sub-fields, such as Phonetics, Prosody, Gesture or Discourse...
- Corpora are annotated with detailed information at various linguistic levels thanks to annotation software(s).
- New requirements are emerging for very large multimodal corpora where manual analysis is impractical.

Getting a corpus

Maybe there is already a corpus you can use?
Data reporitories: depending on the research discipline, data can often be deposited in one or more data centers (or repositories) that will provide access to the data. These repositories may have specific requirements:
- subject/research domain
- data re-use and access
- file format and data structure, and
- metadata.
SLDR:
- http://sldr.org
- Speech and Language Data Repository
- gathering and sharing language data
- long-term preservation by CINES, an institutional archive site.

Corpus annotation: Manual vs. Automatic

The wide range of annotations, from aligned transcripts to gaze to reference to gestural form, is costly to collect and to annotate, both in terms of time and money.
Each annotation that can be done automatically must be done automatically!
Why? Because revising is faster and easier than annotating... if the automatic system is "good enough".

Example of automatic time-alignment vs manual time-alignment

Annotation procedure

Emerging requirements include (Bigbee et al. 2001):
- handling time-based media other than audio/video,
- methods for empirical evaluating tools,
- tool interfaces which are themselves multi-modal,
- support for automated aspects of annotation and time-tagging individual words from speech.

Example of automatic time-alignment, from a speech file and its orthographic transcription

Multi-domain annotations

Must be time-synchronized:
- annotations need to be time-aligned in order to be useful for purposes such as qualitative or quantitative analyses
Temporal information makes it possible to describe simultaneous behaviours:
- of different levels in an utterance (e.g. prosody and locution)
- of different modalities (e.g. speech and gesture)
- of different speakers or extralinguistic events
Time-analysis of multi-level annotations can reveal linguistic structures

Annotation tools

Manual annotation
Automatic annotation
- The current state-of-the-art in Computational Linguistics allows many annotation tasks to be semi- or fully- automated.
- Despite these advances that have been achived for annotating and analysing language, many annotation frameworks and/or models for the construction and analysis of multimodal data continue to rely on "low-tech" and/or manual technologies.

Annotation: data and tool relations

It is unfortunate that there is still today an enormous gap between the community of linguists and phoneticians on the one hand and that of engineers and computer scientists on the other. Each community needs the other and, in an ideal world, linguists would provide theoretical frameworks and data which are useful to engineers, while engineers would provide tools which are useful to linguists. The exchange between the two communities, however, is in practice very slow. (D.J. Hirst 2006: 198)

Annotation tools limitation

Interoperability (Chiarcos et al. 2008):

When multiple annotations are integrated into a single data set, inter-relationships between the annotations can be explored both qualitatively (by using database queries that combine levels) and quantitatively (by running statistical analyses or machine learning algorithms).

However, when such muti-layer corpora are to be created with existing task-specific annotation tools, a new problem arises: output formats of the annotation tools can differ considerably.

Tools for the analysis of annotations

Inter-labeller agreements
Extraction of solely annotations the linguist is interested in
Descriptive statistics
Graphics
...

With the help of multimodal corpora searches, the investigation of the temporal alignment (synchronized co-occurrence, overlap or consecutivity) of gesture and talk has become possible. (Abuczki and Baiat Ghazaleh, 2013)

Automatic annotation analysis

Label: an Event triple < Start, End, Text >, where start and end are time-stamps denoting an Interval
- Analysis: positions; durations (end - start); text extraction; identification of Label text (e.g. pause vs. interpausal sound)
Tier: an ordered time sequence of contiguous Labels < l1, ..., ln >, perhaps with gaps
- Analysis: min/max/mean/median/sd of interval durations; nPVI; slope (acceleration/deceleration)
- Note: relevance for metrical phonology
Tier set: a set of Tiers < t1, ... tm >, whose intervals may coincide, or overlap
- Analysis: relations between different tiers, e.g. between syllables and tones
- Note: relevance for autosegmental phonology

Automatic pairwise duration difference plot

Wagner Quadrants plot (Genre G: Fiction (General)

Automatic item duration comparison

A methodology for annotation...

Annotation may be manual, automatic, or semi-automatic (i.e. automatic with manual corrections)
Annotation is not an end in itself - it is a basis for further analysis
Handling of 'Big data', consisting of large quantities of audio, audio-visual and other multimodal recordings, is beyond the capabilities of purely manual annotation and traditional manual statistical analysis and plotting
Two phases of automation are needed:
- The Automatic Annotator
- The Automatic Analyzer

The Automatic Annotator

The Automatic Annotator time-aligns descriptive data for Tiers such as Phonetics, Prosody, Syntax, Discourse with the recorded signal:

Example of multi-level annotations: only the orthographic transcription is manual

The Automatic Analyzer

The output of the Automatic Annotator is usually manually post-edited before being input to the Automatic Analyzer
The Automatic Analyzer inputs time-aligned data and outputs a report
- about annotation Labels
- sequences of annotation Labels in annotation Tiers
- relations between Labels in sets of annotation Tiers
- with statistics
- with visualisations

A methodology for annotation...

The expected result is time-aligned data, for all annotated levels as Phonetics, Prosody, Gestures, Syntax, Discourse,...

Example of multi-level annotations of GrenelleII corpus

Why a rigorous methodology?

Quick and dirty annotation is possible, unless you expect to:
1. produce reliable annotations
2. perform complex analysis
3. re-use annotations
4. share the corpus and its annotations

Summary

Introduction
Selection of annotation software
Corpus development methodology
Momel and INTSINT
SPPAS
Time Group Analyzer
Conclusion and references

Introduction

Corpus and annotation

Getting a corpus

Corpus annotation: Manual vs. Automatic

Annotation procedure

Multi-domain annotations

Annotation tools

Annotation: data and tool relations

Annotation tools limitation

Tools for the analysis of annotations

Automatic annotation analysis

Automatic pairwise duration difference plot

Automatic item duration comparison

A methodology for annotation...

The Automatic Annotator

The Automatic Analyzer

A methodology for annotation...

Why a rigorous methodology?

Summary