CLeLfPC

Corpus de Lecture en Langue française Parlée Complétée

Created in 2021 by Brigitte Bigi and Maryvonne Zimmermann

The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. It can be used for any research or teaching purpose.

Download the corpus

We asked the 23 volunteer participants to read aloud and to cue one topic among the 10 we prepared. Two participants accepted to read 2 different topics. Each topic was made of 4 sessions; the sessions were recorded separately for the participant to have a short break:

32 isolated “CV” syllables;
32 isolated words or phrases;
isolated sentences;
a text divided into 4-7 parts.

The corpus is made of 4 hours of high quality audio-video recordings.

Corpus Grenelle II and its annotations

Created in 2010-2011 by Brigitte Bigi, Cristel Portes, Agnès Steuckardt, Marion Tellier.

Video downloaded from a FTP server (after authorization), a flv file with poor quality
Audio extracted from the video
A large amount of time-aligned annotations
Freely available: SLDR 000729

Demo

Grenelle II annotations

Enriched orthographic transcription (manual), time-aligned at the utterance level (automatic)
Time-aligned phonemes, tokens and events (automatic)
Time-aligned syllables (automatic)
Prosodic contours and intonation (manual)
Morpho-syntax time-aligned at the token level (automatic)
Hand gestures
Self-repetitions (semi-automatic)
Interruptions (manual)

Publications

B. Bigi, C. Portes, A. Steuckardt, M. Tellier
Multimodal Annotations and Categorization for Political Debates,
ICMI Workshop on Multimodal Corpora for Machine learning (ICMI-MMC), Alicante (Espagne), 2011

B. Bigi, C. Portes, A. Steuckardt, M. Tellier
A Multimodal Study of Answers to Disruptions.,
Journal on Multimodal User Interfaces, Volume 7, Issue 1, Pages 55-66, Springer (Publisher). ISSN 1783-7677. DOI 10.1007/s12193-012-0110-zi, 2012

Corpus MARC-Fr

Created in 2011 by Brigitte Bigi and Pauline Péri

Description :

Corpus in French manually phonetized and time-aligned at the phoneme level. Its duration is 7 minutes (5400 phones), and made of 3 extracts of the following corpora: CID, AixOx and Grenelle.

Download

the audio files in wav format
the orthographic transcription, time-aligned in IPUs, in TextGrid format
the manually time-aligned phonemes in TextGrid and ctm formats

Freely available for downloads:SLDR 000786

Publication

B. Bigi, P. Péri, R. Bertrand
Orthographic Transcription: Which Enrichment is required for Phonetization?,
Language Resources and Evaluation Conference, Istanbul (Turkey), pages 1756-1763, ISBN 978-2-9517408-7-7. 2012

Corpus AixOx

Read corpus created between 2010 and 2012 by Sophie Herment, Anastassia Loukina, Anne Tortel, Daniel Hirst, Brigitte Bigi

Description

40 paragraphes of about 1 minute in French and English, from EUROM 1 corpus. French texts are read by French native speakers (mainly from Aix-en-Provence) and by English native speakers (from Oxford). English texts are read by native English speakers and by French native speakers.

Download

the audio files in wav format
the orthographic transcription, time-aligned in IPUs, in TextGrid format

Freely available for downloads:SLDR 000784

Publication

S. Herment, A. Loukina, A. Tortel, D. Hirst, B. Bigi
AixOx, a multi-layered learners corpus: automatic annotation
Proceedings of international conference on corpus linguistics, Jaèn (Spain), March 2012.

CID - Corpus of Conversational Data

Face-to-face conversations in French
Created by Roxane Bertrand and Béatrice Priego-Valverde
8 semi-guided dialogs (110,000 words)
Recorded in 2003 and 2005
A very large amount of time-aligned annotations
Available at:
- http://sldr.org/sldr000027/
- http://sldr.org/sldr000720/

Demo

Extracts of CID, just for demo

CID - Some of the annotations

Enriched orthographic transcription (manual), time-aligned at the IPU level (automatic)

Time-aligned phonemes and tokens and events like noises, laughter (automatic) and time-aligned syllables (automatic)

Prosodic contours (manual), Momel - Modelization of melody (automatic) and INternational Transcription System for INTonation (automatic)

Morpho-syntax and syntax time-aligned at the token level (automatic) and time-aligned lemmas (automatic);

Dysfluencies (manual), Discourse and interaction (manual), Other- and Self- Repetitions (semi-automatic)

Publication

P. Blache, R. Bertrand, B. Bigi, E. Bruno, E. Cela, R. Espesser, G. Ferré, M. Guardiola, D. Hirst, E.-P. Magro, J.-C. Martin, C. Meunier, M.-A. Morel, E. Murisasco, I Nesterenko, P. Nocera, B. Pallaud, L. Prévot, B. Priego-Valverde, J. Seinturier, N. Tan, M. Tellier, S. Rauzy
Multimodal Annotation of Conversational Data,
The Fourth Linguistic Annotation Workshop, ACL 2010, pages 186-191, Uppsala, Suède, 2010.