Summary

Introduction, context and motivations
Issues addressed: imprecision and indeterminacy
Implementation in an API
Conclusion then open discussion

Corpus annotation "can be defined as the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data. 'Annotation' can also refer to the end-product of this process" (Leech, 1997).

About annotation

Tasks:
- Segmenting
- Labelling
It is a common practice in Corpus Linguistics to admit that annotating is an inherently ongoing process.
Manual vs Automatic

Manual annotation (1)

Linguistics data are annotated several times by one or several annotators, each one annotates according to his/her knowledge, beliefs and uncertainty.

Manual annotation (2)

On the one hand, some annotations can be assigned without any doubt; and this clearly indicates that some properties about the field exist.
On the other hand, it is the fundamental nature of research to deal with hard-to-annotate phenomena that causes indeterminacy during the annotation process.
- Hard to annotate?

Automatic annotation

Most of automatic annotation systems includes a decision-making procedure to deliver a final result, based on scores assigned to a set of possible annotations.
The score often corresponds to the reliability of a solution according to a model.
Close scores could be interpreted as an uncertainty of the decision-making procedure according to a given model.

About Uncertainty

Uncertainty: The lack of certainty. A state of having limited knowledge where it is impossible to exactly describe the existing state, a future outcome, or more than one possible outcome.

Uncertainty is also related to:
- the measurement of uncertainty
- the risk: a state of uncertainty where some possible outcomes have an undesired effect or significant loss
- the measurement of the risk
Quantitative uses of the terms uncertainty and risk are fairly consistent from fields such as probability theory, information theory...

Measurements

In metrology, physics, and engineering, the uncertainty or margin of error of a measurement is stated by giving a range of values likely to enclose the true value. This may be denoted by error bars on a graph, or by the following notations:

Citations

"You cannot be certain about uncertainty" (Frank Knight, economist at University of Chicago)

"You are uncertain, to varying degrees, about everything in the future; much of the past is hidden from you; and there is a lot of the present about which you do not have full information. Uncertainty is everywhere and you cannot escape from it." (Dennis Lindley, Understanding Uncertainty, 2006)

Why is the marking up of uncertain annotations required?

If some annotation does not easily fit into an existing theory, it is likely to be something linguistically interesting and worthy of attention.
For automatic annotations, modelling variances and indeterminacy of an automatic system will reduce the ambiguity in interpreting and applying the resulting annotations.
For manual annotations, keeping information about the annotator's uncertainty is motivated by three important factors:

Manual annotation. Motivation (1)

Until the annotator reaches a decision, he/she is likely to leave the data un-annotated:
- the annotator may postpone the annotation until a decision is reached or may annotate using a question mark.
- consequently the annotated data are not fully analyzed or exploited nor shared or archived.
- Obviously, the way in which a dataset was annotated needs to be described carefully for the data to be useful: if an annotator can label with a "?" and revisit, the labels may be different from when he/she would not have such option.

Manual annotation. Motivation (2 and 3)

To allow other annotators to share experience about this annotation. Perhaps a similar hard-to-annotate concept has been encountered by another annotator, and if this problem is truly new, then:
- at least this uncertain concept can be studied!
In some cases, sharing experiences about uncertain annotations can lead to certainty. In some other cases, annotating uncertainty will highlight the fact that a concept is inherently indeterminate.

Annotation: the most common practices

The annotation of recordings is concerned by many Linguistics sub-fields as Phonetics, Prosody, Gestures or Discourse...
Corpora are annotated with detailed information at various linguistic levels thanks to annotation software(s).
None of the existing audio/video annotation software allows to represent the annotator's uncertainty.

Annotation in softwares

There has been a tendency to ignore phenomena other than those that can be easily represented as time intervals for the temporal relation and a simple string for the label of the annotation.

UML representation

Annotation representation (Praat / Phonedit)

Praat: Short TextGrid

Praat: Long TextGrid

Phonedit: mrk

Annotation representation (Elan)

Elan: eaf

Annotation representation: discrepancy

The suggestion that a concept can be adequately represented only by these information is false, as is well-known to anyone who has had to model both time and label.

By using current annotation frameworks and tools, how do annotators formulate their uncertainty?

Annotation representation: common practice (1)

creating a new tier, then adding each comment as an annotation itself.

Annotation representation: common practice (2)

add comments between brackets, braces or parenthesis, and/or add a question mark inside the label of the annotation itself.

What about explicitly modelling uncertainties of annotations?

After all, if they can be identified, they can be modelled.
A general speech annotation framework therefore needs to allow the representation of uncertainty, for these annotations to become part of the framework itself.

Issues addressed: imprecision and indeterminacy

Three main issues

To represent the precision/imprecision of a time value (segmenting)
To assign a location to the annotation (segmenting)
To assign a text to the annotation (labelling)

Representing the precision of a time value

Several factors impact the precision

the human decision (the aims of the work, the time dedicated to annotate, etc);

with an audio or a video media, the better precision corresponds to the duration of a frame of the media:
- video: the duration of one picture is often 0.04 s
- audio sampled at 20000Hz: one frame is 0.00005 s

the graphical user interface of the tool: in the best case, the precision of the "time value" corresponds to 1 pixel.

1 sec. on 1000px implies 1px = 1ms:
- if sound at 20000Hz: 1px => 20 frames
- if video at 25fps: 1px => 0.025 frame

the file format of the annotation tool.

Praat: Long TextGrid Elan: eaf

Proposed representation of a point in Time

Our proposal is to define a TimePoint \(X\) as an imprecise value ranging from a time value \(x^-\) to a time value \(x^+\).
The question then arise on:
- how to represent such intervals and
- how to maintain compatibility with current systems.

Model a TimePoint \(X\) as \(X \in \mathbb{R}_+, X=(M_X,R_X)\) where:
- \(M_X\) is the midpoint (center) of the TimePoint and
- \(R_X\) is the radius representing the vagueness of \(X\).

Benefits

This proposal solves the 4 problems mentionned previously
It allows the annotator to annotate the localization of an interval as "the annotation starts about here, and ends about there".
It may in some cases prevent the analyst from drawing wrong conclusions unsupported by the original data.

Graphical representation

Fixing the location of an annotation

This issue consist to represent the following assumption: "the annotation either is here, or there"

Example, from (Rohde, 2007)

John(Source) handed a book to Bob(Goal). He _________.
John(Source) was handing a book to Bob(Goal). He _____.

The context sentences in (1) and (2) contain two possible referents for the pronoun 'He', one that appears in subject position and fills the Source thematic role, and one that appears as the object of a prepositional phrase and fills the Goal thematic role.

Assigning a label

Allow a multiple label selection and assign a score to each possible label.
Notice that the label is then a list of pairs text/score but not a 2-tuples: two labels are equals if their texts are equals regardless of their scores.

XML: an (un-realistic) example

An ambiguous label, at an ambiguous location with imprecise time localizations:

Implementation in an Application Programming Interface

Implementing in an API

Several programs were created in the scope of this work, particularly to enable the integration of annotation/analysis, and to support resource conversion and exploitation.
The proposed framework is implemented using the programming language Python.
The API is included in SPPAS.

The proposed framework into SPPAS

Actually, it is quite easy to read some existing annotation file formats and to instantiate them into the proposed framework to use it.
- Among others, SPPAS allows to import files from Praat, Transcriber and Elan.
The automatic annotations included in SPPAS are currently using the MidPoint/Radius representation
A (prototype of) query system is available

Future works

Improve existing import/export
Add other import/export: Glozz, ...
Extend the query system
Automatic annotations extended:
- to use the ambiguous labels
- to use the ambiguous location

Conclusion, then open discussion

To sum up...

The main proposal of our framework is to represent each annotation by:

at least one uncertain time localization, i.e. one of:
- a TimePoint instance \(X=(M_X,R_X)\);
- a TimeInterval \(X=[X^-,X^+]\), where \(X^-=(M_{X^-},R_{X^-})\) and \(X^+=(M_{X^+},R_{X^+})\) are TimePoint and \(X^- \neq X^+\). This means that a TimeInterval is a proper interval, that is neither empty nor degenerate.
an uncertain label, represented by a list of couples text/score.

Integration

Our proposal only relies on the representation of an Annotation and consequently it could be an extension of many existing annotation tool or scheme.
It is assumed that the existence of corpora annotated with rich information about annotator hesitations or difficulties would support the development and evaluation of NLP systems that exploit such information:
- estimating a Kappa
- exploring/extracting/requesting annotated data

Uncertainty-tolerant framework for multimodal corpus annotation

Brigitte Bigi

LPL - February, 6th, 2015

Summary

Introduction

About annotation

Manual annotation (1)

Manual annotation (2)

Automatic annotation

About Uncertainty

Measurements

Citations

Why is the marking up of uncertain annotations required?

Manual annotation. Motivation (1)

Manual annotation. Motivation (2 and 3)

Annotation: the most common practices

Annotation in softwares

Annotation representation (Praat / Phonedit)

Annotation representation (Elan)

Annotation representation: discrepancy

Annotation representation: common practice (1)

Annotation representation: common practice (2)

What about explicitly modelling uncertainties of annotations?

Issues addressed: imprecision and indeterminacy

Three main issues

Representing the precision of a time value

Proposed representation of a point in Time

Benefits

Graphical representation

UML representation

Fixing the location of an annotation

Example, from (Rohde, 2007)

UML representation

Assigning a label

Final UML Diagram

XML: an (un-realistic) example

Implementation in an Application Programming Interface

Implementing in an API

SPPAS Screenshot

The proposed framework into SPPAS

Future works

Conclusion, then open discussion

To sum up...

Integration

Open discussion...