Data Analyses

Introduction

The analyses of annotated files include the descriptive statistics, the filtering of the annotated data to get only the annotations you are interested in, see/edit the information about files, etc.

Like the other features of SPPAS, analyzing data can be performed in three different ways:

the Application Programming Interface, in Python language;
the Command-Line User Interface;
the Graphical User Interface.

Among the features implemented in the API, a big majority are included in the GUI, but just a few can be performed with the CLI. This chapter then describes only the page Analyze of the GUI.

The page Analyze of the GUI

The page Analyze of the GUI is divided into two main areas: a toolbar and a content to represent files.

Displayed files content

Each opened file is displayed in a panel in the content area. The next Figure indicates what panels of four different files look like. Panels of the annotated files have a yellow background, panels of the audio files are in blue color and panels of unknown files are pinky-red.

Some actions can be performed individually on the panel of a file. A mouse click on the filename or on the arrow at left will show or hide the content of the file.

For transcription files, the icons at top-right of the panel allow:

editing the metadata of the file - notice that only the XRA file format allows saving the metadata;
selecting the file (for example, it’s used to paste tiers into);
saving the file - only if it has changed;
closing the file and unlock it in the workspace.

For audio files, the icons at top-right of the panel allow:

viewing audio content and manage channels - available only for files less than a few minutes;
closing the file.

The toolbar of the page is made of three different parts:

Files: click on these buttons to perform an action on the checked filenames of the page Files;
Tiers: click on these buttons to perform an action on the checked tiers of the opened files;
Annotations: click on these buttons to analyze the annotations of the checked tiers, of the opened files.

Files: Open files

Open, load and display a panel for each checked file of the current workspace. Some files could need a long time to be loaded (like TextGrid files with a lot of annotations), so a scrollbar should indicate the progression. As soon as a file is opened, it is locked and no other page can perform an action on it.

To open new files, check new files in the current workspace and click again on the Open files button. The panels of the newly opened files will be appended to the existing ones.

Files: New file

Click on this button to create a new annotated file. A dialog box will ask for a path/filename and for a file extension. The extension defines the file format; any of the supported file format can be used. The file will be created on disk when it is saved for the first time.

Files: Save all

Save all files for which some changes were done, without confirmation.

Files: Close all

Close all the opened files. If some files were changed and not saved, a dialog will ask for confirmation.

Tiers: Metadata

Open a dialog to edit the metadata of the checked tiers. Notice that most of the file format does not allow saving metadata, or it allows only some specific ones; only the XRA format can save any metadata.

Tiers: Check

A click on the Check button will open a dialog to enter a tier name, and it will check all tiers matching it. The entry is a regular expression.

Tiers: Uncheck

A click on the Uncheck button will un-check all tiers.

Tiers: Rename

A mouse click on the Rename button will open a dialog to fix a name of a tier and then to rename all checked tiers. If a file already has a tier with the given name, an index number will be appended to the new name.

Tiers: Delete

A mouse click on the Delete button will delete all checked tiers. This process is irreversible. To recover a deleted tier, the only way is to close the file and to re-open it but all the unsaved changes are lost.

Tiers: Cut/Copy/Paste

The Cut/Copy/Paste buttons make use of a clipboard to manage tiers. Tiers can then be copied from files to other ones. The files to paste in must be selected first with the select button of the individual panels.

Tiers: Duplicate

The duplicate button is designed to copy/paste a tier into the same file. The name of the duplicated tier will be the same as the original one with an index number at the end.

Tiers: Move Up/Move Down

These buttons allow moving checked tiers into a file.

Annotations: Radius

The Radius button is to fix a radius for all the annotation's localizations of the checked tiers, i.e., the vagueness around the fixed point in time. Notice that only XRA file format can save it.

Read the following paper for details:

Brigitte Bigi, Tatsuya Watanabe, Laurent Prévot (2014). Representing Multimodal Linguistics Annotated Data, 9th International conference on Language Resources and Evaluation (LREC), Reykjavik (Iceland), pages 3386-3392. ISBN: 978-2-9517408-8-4.

Annotations: View

Click the button View to see all the annotations of the checked tiers in a table.

Annotations: Statistics

It can estimate the occurrences, the duration, ... of the annotations of the checked tiers, and allows saving results in CSV (for Excel, OpenOffice, R, MatLab,…).

It offers series of sheets organized in a notebook. The first tab is displaying descriptive statistics of the set of given tiers. The other tabs are indicating one of the statistics over the given tiers. The followings are estimated:

occurrences: the number of observations
total durations: the sum of the durations
mean durations: the arithmetic mean of the duration
median durations: the median value of the distribution of durations
std dev. durations: the standard deviation value of the distribution of durations

All of them can be estimated on a single annotation label or on a series of them. The length of this context can be optionally changed while fixing the N-gram value (available from 1 to 5), just above the sheets.

Each displayed sheet can be saved as a CSV file, which is a useful file format to be read by R, Excel, OpenOffice, LibreOffice, and so… To do so, display the sheet you want to save and click on the button Save sheet, just below the sheets. If you plan to open this CSV file with Excel under Windows, it is recommended to change the encoding to UTF-16. For the other cases, UTF-8 is probably the most relevant.

The annotation durations are commonly estimated on the Midpoint value, without taking the radius into account; see (Bigi et al., 2012) for explanations about the Midpoint/Radius. Optionally, the duration can either be estimated by taking the vagueness into account, then check Add the radius value button, or by ignoring the vagueness and estimating only on the central part of the annotation, then check Deduct the radius value.

For those who are estimating statistics on XRA files, you can either estimate stats only on the best label (the label with the higher score) or on all labels, i.e., the best label and all its alternatives (if any).

Annotations: Single filter

Define your filters to create new tiers with only the annotations you are interested in!

Pattern selection is an important part to extract data of a corpus and is obviously and important part of any filtering system. Thus, if the label of an annotation is a string, the following filters are proposed in DataFilter:

exact match: an annotation is selected if its label strictly corresponds to the given pattern;
contains: an annotation is selected if its label contains the given pattern;
starts with: an annotation is selected if its label starts with the given pattern;
ends with: an annotation is selected if its label ends with the given pattern.

All these matches can be reversed to represent respectively: does not exactly match, does not contain, does not start with or does not end with. Moreover, this pattern matching can be case-sensitive or not.

For complex search, a selection based on regular expressions is available for advanced users.

A multiple pattern selection can be expressed in both ways:

enter multiple patterns at the same time (separated by commas) to mention the system to retrieve either one pattern or the other, etc.
enter one pattern at a time and choose the appropriate button: Apply All or Apply any.

Frame to create a filter on annotation label tags: filter annotations that exactly match
either a, @ or E — Frame to create a filter on annotation label tags: filter annotations that exactly match either 'a' or '@' or 'E'.

Another important feature for a filtering system is the possibility to retrieve annotated data of a certain duration, and in a certain range of time in the timeline.

Frame to create a filter on annotation durations: filter annotations that are during more that
80 ms — Frame to create a filter on annotation durations: filter annotations that are during more than 80 ms

Search can also start and/or ends at specific time values in a tier.

Frame to create a filter on annotation time values: filter annotations that are starting after
the 5th minute — Frame to create a filter on annotation time values: filter annotations that are starting after the 5th minute

In SPPAS 3.7, a new filter is added: it can select annotation depending on their number of labels. For example, the automatic annotation Normalization creates a tier Tokens in which each annotation contains a list of labels - one per token; then it is possible to get the annotations with more than three tokens, with only one token, etc.

All the given filters are summarized in the SingleFilter dialog. To complete the filtering process, it must be clicked on one of the 'Apply' buttons and the new resulting tiers are added in the annotation file(s).

In the given example:

click on Apply All to get either 'a' or '@' or 'E' vowels during more than 80ms, after the 5th minute.
click on Apply Any to get 'a' or '@' or 'E' vowels, and all annotations during more than 80 ms, and all annotations after the 5th minute.

Read the following publications for details:

Brigitte Bigi (2019). Filtering multi-levels annotated data. In 9th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 13-14, Poznań, Poland.

Brigitte Bigi, Jorane Saubesty (2015). Searching and retrieving multi-levels annotated data, Proceedings of Gesture and Speech in Interaction, Nantes (France).

Annotations: Relation filter

Regarding the searching problem, linguists are typically interested in locating patterns on specific tiers, with the possibility to relate different annotations a tier from another. The proposed system offers a powerful way to request/extract data, with the help of Allen’s interval algebra.

In 1983 James F. Allen published a paper in which he proposed 13 basic relations between time intervals that are distinct, exhaustive, and qualitative:

distinct because no pair of definite intervals can be related by more than one of the relationships;
exhaustive because any pair of definite intervals are described by one of the relations;
qualitative (rather than quantitative) because no numeric time spans are considered.

These relations and the operations on them form Allen’s interval algebra. These relations were extended to Interval-Tiers as Point-Tiers to be used to find/select/filter annotations in any kind of time-aligned tiers.

For the sake of simplicity, only the 13 relations of the Allen’s algebra are available in the GUI. But actually, we implemented the 25 relations proposed Pujari and al. (1999) in the INDU model. This model is fixing constraints on INtervals (with Allen’s relations), and on DUration (durations are equals, one is lesser/greater than the other). Such relations are available while requesting with Python.

At a first stage, the user must select the tiers to be filtered and click on RelationFilter. The second stage is to select the tier that will be used for time-relations.

The next step consists in checking the Allen’s relations that will be applied. The last stage is to fix the name of the resulting tier. The above screenshots illustrates how to select the first phoneme of each token, except for tokens that are made of only one phoneme (in this later case, the equal relation should be checked).

To complete the filtering process, it must be clicked on the Apply button and the new resulting tiers are added in the annotation file(s).