Ortolang repository

About

The Ortolang repository allows to download the linguistic resources. It contains two types of data, organized into separate folders:

  • lang folder: This folder contains language-specific resources required for automatic annotation. Each language may include lexicons, pronunciation dictionaries, acoustic models, and/or syllabification rules. These resources are provided as ZIP files, named using the ISO 639-3 language code (e.g., fra for French, eng for English, cmn for Mandarin Chinese). For a complete list of language codes, visit http://www-01.sil.org/iso639-3/.
  • annot folder: This folder contains additional linguistic resources needed for certain types of automatic annotation, such as statistical models. These resources are also available as ZIP files, with one file per supported annotation type.

The available resources were initially created for use with SPPAS), but since they are open source, they can be freely downloaded, used with other annotation tools, and even modified or redistributed in most cases.

Repository versions history

Version 1 - June, 2020

  • Linguistic resources for Text Normalization, Phonetization, Alignment and Syllabification of SPPAS for the following languages: cat, cmn, deu, eng, fra, frq, hun, ita, jpn, kor, nan, pcm, pol, por, spa, vie, yue.
  • Data resources for face detection, face landmark and LfPC automatic annotations of SPPAS.

Version 2 - July 2020

  • Updated data for face detection and face landmark.

Version 3 - Sept 2020

  • Updated linguistic resources of Polish language.
  • Add of this documentation.

Version 4 - Feb 2021

  • A DNN model is added into Face Detection package
  • The file fra.txt of LfPC package is modified – corrected by experts
  • Lightness of LPC package hand pictures is adjusted

Version 5 - Sept 2021

  • New acoustic model of Italian: it’s no longer a context-dependent model. It’s a monophone model like for the other languages. The French HMMs of a~ and O~, and the Naija HMM of e~ were added to the hmmdefs.
  • The LPC file fra.txt is renamed cueConfig-fra.txt.
  • The keys of the LPC vowels are coded with characters b, s, m, c, t instead of numbers. It’s not compatible with versions 3.x of SPPAS.
  • The keys of the LfPC consonant are coded differently, we now use the same as the previously defined ones for English.

Version 6 - Nov 2021

  • New resources for Bengali language: vocabulary, pronunciation dictionary and acoustic model.

Version 7 - 2022

Updated resources for Bengali language.

Version 8 - 2023

Updated CuedSpeech resources (renamed from LfPC).

Added resources for Persian language.

Version 9 - March 2025

  • Updated CuedSpeech resources.
  • Added resources for Dutch.
  • Added samples of audio files and transcripts in each language resource package.