Brigitte Bigi
Publications Software Corpus

CLeLfPC - Corpus de Lecture en Langue française Parlée Complétée

Created in 2021 by Brigitte Bigi and Maryvonne Zimmermann

The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. It can be used for any research or teaching purpose. The corpus is hosted by ortolang repository.

We asked the 23 volunteer participants to read aloud and to cue one topic among the 10 we prepared. Two participants accepted to read 2 different topics. Each topic was made of 4 sessions; the sessions were recorded separately for the participant to have a short break:

  1. 32 isolated “CV” syllables;
  2. 32 isolated words or phrases;
  3. isolated sentences;
  4. a text divided into 4-7 parts.

The corpus is made of 4 hours of high quality audio-video recordings.

Corpus Grenelle II and its annotations

Created in 2010-2011 by Brigitte Bigi, Cristel Portes, Agnès Steuckardt, Marion Tellier.


Grenelle II annotations

  1. Enriched orthographic transcription (manual), time-aligned at the utterance level (automatic)
  2. Time-aligned phonemes, tokens and events (automatic)
  3. Time-aligned syllables (automatic)
  4. Prosodic contours and intonation (manual)
  5. Morpho-syntax time-aligned at the token level (automatic)
  6. Hand gestures
  7. Self-repetitions (semi-automatic)
  8. Interruptions (manual)


B. Bigi, C. Portes, A. Steuckardt, M. Tellier
Multimodal Annotations and Categorization for Political Debates,
ICMI Workshop on Multimodal Corpora for Machine learning (ICMI-MMC), Alicante (Espagne), 2011

B. Bigi, C. Portes, A. Steuckardt, M. Tellier
A Multimodal Study of Answers to Disruptions.,
Journal on Multimodal User Interfaces, Volume 7, Issue 1, Pages 55-66, Springer (Publisher). ISSN 1783-7677. DOI 10.1007/s12193-012-0110-zi, 2012

Corpus MARC-Fr

Created in 2011 by Brigitte Bigi and Pauline Péri

Description :

Corpus in French manually phonetized and time-aligned at the phoneme level. It's duration is 7 minutes (5400 phones), and made of 3 extracts of the following corpora: CID, AixOx and Grenelle.


Freely available for downloads:SLDR 000786


B. Bigi, P. Péri, R. Bertrand
Orthographic Transcription: Which Enrichment is required for Phonetization?,
Language Resources and Evaluation Conference, Istanbul (Turkey), pages 1756-1763, ISBN 978-2-9517408-7-7. 2012

Corpus AixOx

Read corpus created between 2010 and 2012 by Sophie Herment, Anastassia Loukina, Anne Tortel, Daniel Hirst, Brigitte Bigi


40 paragraphes of about 1 minute in French and English, from EUROM 1 corpus. French texts are read by French native speakers (mainly from Aix-en-Provence) and by English native speakers (from Oxford). English texts are read by native English speakers and by French native speakers.


Freely available for downloads:SLDR 000784


S. Herment, A. Loukina, A. Tortel, D. Hirst, B. Bigi
AixOx, a multi-layered learners corpus: automatic annotation
Proceedings of international conference on corpus linguistics, Jaèn (Spain), March 2012.

CID - Corpus of Conversational Data


Extracts of CID, just for demo

CID - Some of the annotations

Enriched orthographic transcription (manual), time-aligned at the IPU level (automatic)

Time-aligned phonemes and tokens and events like noises, laughter (automatic) and time-aligned syllables (automatic)

Prosodic contours (manual), Momel - Modelization of melody (automatic) and INternational Transcription System for INTonation (automatic)

Morpho-syntax and syntax time-aligned at the token level (automatic) and time-aligned lemmas (automatic);

Dysfluencies (manual), Discourse and interaction (manual), Other- and Self- Repetitions (semi-automatic)


P. Blache, R. Bertrand, B. Bigi, E. Bruno, E. Cela, R. Espesser, G. Ferré, M. Guardiola, D. Hirst, E.-P. Magro, J.-C. Martin, C. Meunier, M.-A. Morel, E. Murisasco, I Nesterenko, P. Nocera, B. Pallaud, L. Prévot, B. Priego-Valverde, J. Seinturier, N. Tan, M. Tellier, S. Rauzy
Multimodal Annotation of Conversational Data,
The Fourth Linguistic Annotation Workshop, ACL 2010, pages 186-191, Uppsala, Suède, 2010.