Uncertainty-tolerant framework for multimodal corpus annotation

Brigitte Bigi

LPL - February, 6th, 2015

Summary

Corpus annotation "can be defined as the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data. 'Annotation' can also refer to the end-product of this process" (Leech, 1997).

Introduction

About annotation

Manual annotation (1)

Manual annotation (2)

Automatic annotation

About Uncertainty

Uncertainty: The lack of certainty. A state of having limited knowledge where it is impossible to exactly describe the existing state, a future outcome, or more than one possible outcome.

Measurements

In metrology, physics, and engineering, the uncertainty or margin of error of a measurement is stated by giving a range of values likely to enclose the true value. This may be denoted by error bars on a graph, or by the following notations:

Citations

"You cannot be certain about uncertainty" (Frank Knight, economist at University of Chicago)

"You are uncertain, to varying degrees, about everything in the future; much of the past is hidden from you; and there is a lot of the present about which you do not have full information. Uncertainty is everywhere and you cannot escape from it." (Dennis Lindley, Understanding Uncertainty, 2006)

Why is the marking up of uncertain annotations required?

Manual annotation. Motivation (1)

  1. Until the annotator reaches a decision, he/she is likely to leave the data un-annotated:
    • the annotator may postpone the annotation until a decision is reached or may annotate using a question mark.
    • consequently the annotated data are not fully analyzed or exploited nor shared or archived.
    • Obviously, the way in which a dataset was annotated needs to be described carefully for the data to be useful: if an annotator can label with a "?" and revisit, the labels may be different from when he/she would not have such option.

Manual annotation. Motivation (2 and 3)

  1. To allow other annotators to share experience about this annotation. Perhaps a similar hard-to-annotate concept has been encountered by another annotator, and if this problem is truly new, then:
    • at least this uncertain concept can be studied!
  2. In some cases, sharing experiences about uncertain annotations can lead to certainty. In some other cases, annotating uncertainty will highlight the fact that a concept is inherently indeterminate.

Annotation: the most common practices

Annotation in softwares

UML representation
UML representation

Annotation representation (Praat / Phonedit)

Praat: Short TextGrid
Praat: Short TextGrid
Praat: Long TextGrid
Praat: Long TextGrid
Phonedit: mrk
Phonedit: mrk

Annotation representation (Elan)

Elan: eaf
Elan: eaf

Annotation representation: discrepancy

The suggestion that a concept can be adequately represented only by these information is false, as is well-known to anyone who has had to model both time and label.

Annotation representation: common practice (1)

Annotation representation: common practice (2)

What about explicitly modelling uncertainties of annotations?

Issues addressed: imprecision and indeterminacy

Three main issues

  1. To represent the precision/imprecision of a time value (segmenting)
  2. To assign a location to the annotation (segmenting)
  3. To assign a text to the annotation (labelling)

Representing the precision of a time value

  1. the human decision (the aims of the work, the time dedicated to annotate, etc);

vs

  1. with an audio or a video media, the better precision corresponds to the duration of a frame of the media:
    • video: the duration of one picture is often 0.04 s
    • audio sampled at 20000Hz: one frame is 0.00005 s
  1. the graphical user interface of the tool: in the best case, the precision of the "time value" corresponds to 1 pixel.
  1. the file format of the annotation tool.

Praat: Long TextGrid Elan: eaf

Proposed representation of a point in Time

Benefits

Graphical representation

UML representation

Fixing the location of an annotation

Example, from (Rohde, 2007)

  1. John(Source) handed a book to Bob(Goal). He _________.
  2. John(Source) was handing a book to Bob(Goal). He _____.

The context sentences in (1) and (2) contain two possible referents for the pronoun 'He', one that appears in subject position and fills the Source thematic role, and one that appears as the object of a prepositional phrase and fills the Goal thematic role.

UML representation

Assigning a label

Final UML Diagram

XML: an (un-realistic) example

An ambiguous label, at an ambiguous location with imprecise time localizations:

Implementation in an Application Programming Interface

Implementing in an API

SPPAS Screenshot

The proposed framework into SPPAS

Future works

Conclusion, then open discussion

To sum up...

  1. at least one uncertain time localization, i.e. one of:
    • a TimePoint instance \(X=(M_X,R_X)\);
    • a TimeInterval \(X=[X^-,X^+]\), where \(X^-=(M_{X^-},R_{X^-})\) and \(X^+=(M_{X^+},R_{X^+})\) are TimePoint and \(X^- \neq X^+\). This means that a TimeInterval is a proper interval, that is neither empty nor degenerate.
  2. an uncertain label, represented by a list of couples text/score.

Integration

Open discussion...