SPPAS 4.22

https://sppas.org/

Module sppas.src.resources

Class sppasDictPron

Description

Pronunciation dictionary manager.

A pronunciation dictionary contains a list of tokens, each one with a list of possible pronunciations.

sppasDictPron can load the dictionary from an HTK-ASCII file. Each line of such file looks like the following: acted [acted] { k t e d acted(2) [acted] { k t i d The first columns indicates the tokens, eventually followed by the variant number into braces. The second column (with brackets) is ignored. It should contain the token. Other columns are the phones separated by whitespace. sppasDictPron accepts missing variant numbers, empty brackets, or missing brackets.

Example
 >>> d = sppasDictPron('eng.dict')
 >>> d.add_pron('acted', '{ k t e')
 >>> d.add_pron('acted', '{ k t i')

Then, the phonetization of a token can be accessed with get_pron() method:

Example
 >>> print(d.get_pron('acted'))
 >>> {-k-t-e-d|{-k-t-i-d|{-k-t-e|{-k-t-i

The following convention is adopted to represent the pronunciation variants:

  • '-' separates the phones (X-SAMPA standard)
  • '|' separates the variants

Notice that tokens in the dict are case-insensitive.

Constructor

Create a sppasDictPron instance.

A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load is divided by two or three.

Parameters
  • dict_filename: (str) Name of the file of the pronunciation dict
  • nodump: (bool) Create or not a dump file.
View Source
def __init__(self, dict_filename=None, nodump=False):
    """Create a sppasDictPron instance.

    A dump file is a binary version of the dictionary. Its size is greater
    than the original ASCII dictionary but the time to load is divided
    by two or three.

    :param dict_filename: (str) Name of the file of the pronunciation dict
    :param nodump: (bool) Create or not a dump file.

    """
    self._filename = ''
    self._dict = dict()
    if dict_filename is not None:
        self._filename = dict_filename
        dp = sppasDumpFile(dict_filename)
        data = None
        if nodump is False:
            data = dp.load_from_dump()
        if data is None:
            self.load(dict_filename)
            if nodump is False:
                dp.save_as_dump(self._dict)
        else:
            self._dict = data

Public functions

get_filename

Return the name of the file from which the dict comes from.

View Source
def get_filename(self):
    """Return the name of the file from which the dict comes from."""
    return self._filename

get_unkstamp

Return the unknown words stamp.

View Source
def get_unkstamp(self):
    """Return the unknown words stamp."""
    return symbols.unk

get

Return the pronunciations of an entry in the dictionary.

Parameters
  • entry: (str) A token to find in the dictionary
  • substitution: (str) String to return if token is missing of dict
Returns
  • unicode of the pronunciations or the substitution.
View Source
def get(self, entry, substitution=symbols.unk):
    """Return the pronunciations of an entry in the dictionary.

        :param entry: (str) A token to find in the dictionary
        :param substitution: (str) String to return if token is missing of dict
        :returns: unicode of the pronunciations or the substitution.

        """
    s = sppasDictPron.format_token(entry)
    return self._dict.get(s, substitution)

get_pron

Return the pronunciations of an entry in the dictionary.

Parameters
  • entry: (str) A token to find in the dictionary
Returns
  • unicode of the pronunciations or the unknown stamp.
View Source
def get_pron(self, entry):
    """Return the pronunciations of an entry in the dictionary.

        :param entry: (str) A token to find in the dictionary
        :returns: unicode of the pronunciations or the unknown stamp.

        """
    s = sppasDictPron.format_token(entry)
    p = self._dict.get(s, symbols.unk)
    if p is None:
        return symbols.unk
    return p

is_unk

Return True if an entry is unknown (not in the dictionary).

Parameters
  • entry: (str) A token to find in the dictionary
Returns
  • bool
View Source
def is_unk(self, entry):
    """Return True if an entry is unknown (not in the dictionary).

        :param entry: (str) A token to find in the dictionary
        :returns: bool

        """
    return sppasDictPron.format_token(entry) not in self._dict

is_pron_of

Return True if pron is a pronunciation of entry.

Phonemes of pron are separated by "-".

Parameters
  • entry: (str) A unicode token to find in the dictionary
  • pron: (str) A unicode pronunciation
Returns
  • bool
View Source
def is_pron_of(self, entry, pron):
    """Return True if pron is a pronunciation of entry.

        Phonemes of pron are separated by "-".

        :param entry: (str) A unicode token to find in the dictionary
        :param pron: (str) A unicode pronunciation
        :returns: bool

        """
    s = sppasDictPron.format_token(entry)
    if s in self._dict:
        p = sppasUnicode(pron).to_strip()
        return p in self._dict[s].split(separators.variants)
    return False

format_token

Remove the CR/LF, tabs, multiple spaces and others... and lowerise.

Parameters
  • entry: (str) a token
Returns
  • formatted token
View Source
@staticmethod
def format_token(entry):
    """Remove the CR/LF, tabs, multiple spaces and others... and lowerise.

        :param entry: (str) a token
        :returns: formatted token

        """
    t = sppasUnicode(entry).to_strip()
    return sppasUnicode(t).to_lower()

add_pron

Add a token/pron to the dict.

Parameters
  • token: (str) Unicode string of the token to add
  • pron: (str) A pronunciation in which the phonemes are separated by whitespace
View Source
def add_pron(self, token, pron):
    """Add a token/pron to the dict.

        :param token: (str) Unicode string of the token to add
        :param pron: (str) A pronunciation in which the phonemes are separated by whitespace

        """
    entry = sppasDictPron.format_token(token)
    new_pron = sppasUnicode(pron).to_strip()
    new_pron = new_pron.replace(' ', separators.phonemes)
    cur_pron = ''
    if entry in self._dict:
        if self.is_pron_of(entry, new_pron) is False:
            cur_pron = self.get_pron(entry) + separators.variants
        else:
            cur_pron = self.get_pron(entry)
            new_pron = ''
    new_pron = cur_pron + new_pron
    self._dict[entry] = new_pron

map_phones

Create a new dictionary by changing the phoneme strings.

Perform changes depending on a mapping table.

Parameters
  • map_table: (Mapping) A mapping table
Returns
  • a sppasDictPron instance with mapped phones
View Source
def map_phones(self, map_table):
    """Create a new dictionary by changing the phoneme strings.

        Perform changes depending on a mapping table.

        :param map_table: (Mapping) A mapping table
        :returns: a sppasDictPron instance with mapped phones

        """
    map_table.set_reverse(True)
    delimiters = [separators.variants, separators.phonemes]
    new_dict = sppasDictPron()
    for key, value in self._dict.items():
        new_dict._dict[key] = map_table.map(value, delimiters)
    return new_dict

load

Load a pronunciation dictionary.

Parameters
  • filename: (str) Pronunciation dictionary file name
View Source
def load(self, filename):
    """Load a pronunciation dictionary.

        :param filename: (str) Pronunciation dictionary file name

        """
    try:
        with codecs.open(filename, 'r', sg.__encoding__) as fd:
            self._filename = filename
            first_line = fd.readline()
            fd.close()
    except IOError:
        raise FileIOError(filename)
    except UnicodeDecodeError:
        raise FileUnicodeError(filename)
    if first_line.startswith('<?xml'):
        self.load_from_pls(filename)
    else:
        self.load_from_ascii(filename)

load_from_ascii

Load a pronunciation dictionary from an HTK-ASCII file.

Parameters
  • filename: (str) Pronunciation dictionary file name
View Source
def load_from_ascii(self, filename):
    """Load a pronunciation dictionary from an HTK-ASCII file.

        :param filename: (str) Pronunciation dictionary file name

        """
    try:
        with codecs.open(filename, 'r', sg.__encoding__) as fd:
            lines = fd.readlines()
            fd.close()
    except Exception:
        raise FileIOError(filename)
    for l, line in enumerate(lines):
        uline = sppasUnicode(line).to_strip()
        if len(uline) == 0:
            continue
        if len(uline) == 1:
            raise FileFormatError(l, uline)
        i = uline.find('[')
        if i == -1:
            i = uline.find(' ')
        entry = uline[:i]
        endline = uline[i:]
        j = endline.find(']')
        if j == -1:
            j = endline.find(' ')
        new_pron = endline[j + 1:]
        i = entry.find('(')
        if i > -1:
            if ')' in entry[i:]:
                entry = entry[:i]
        self.add_pron(entry, new_pron)

save_as_ascii

Save the pronunciation dictionary in HTK-ASCII format.

Parameters
  • filename: (str) Dictionary file name
  • withvariantnb: (bool) Write the variant number or not
  • withfilledbrackets: (bool) Fill the bracket with the token
View Source
def save_as_ascii(self, filename, with_variant_nb=True, with_filled_brackets=True):
    """Save the pronunciation dictionary in HTK-ASCII format.

        :param filename: (str) Dictionary file name
        :param with_variant_nb: (bool) Write the variant number or not
        :param with_filled_brackets: (bool) Fill the bracket with the token

        """
    try:
        with codecs.open(filename, 'w', encoding=sg.__encoding__) as output:
            for entry, value in sorted(self._dict.items(), key=lambda x: x[0]):
                variants = value.split(separators.variants)
                for i, variant in enumerate(variants, 1):
                    variant = variant.replace(separators.phonemes, ' ')
                    brackets = entry
                    if with_filled_brackets is False:
                        brackets = ''
                    if i > 1 and with_variant_nb is True:
                        line = '{:s}({:d}) [{:s}] {:s}\n'.format(entry, i, brackets, variant)
                    else:
                        line = '{:s} [{:s}] {:s}\n'.format(entry, brackets, variant)
                    output.write(line)
    except Exception as e:
        logging.info('Saving the dictionary in ASCII failed: {:s}'.format(str(e)))
        return False
    return True

load_from_pls

Load a pronunciation dictionary from a pls file (xml).

xmlns="http://www.w3.org/2005/01/pronunciation-lexicon

Parameters
  • filename: (str) Pronunciation dictionary file name
View Source
def load_from_pls(self, filename):
    """Load a pronunciation dictionary from a pls file (xml).

        xmlns="http://www.w3.org/2005/01/pronunciation-lexicon

        :param filename: (str) Pronunciation dictionary file name

        """
    try:
        tree = ET.parse(filename)
        root = tree.getroot()
        try:
            uri = root.tag[:root.tag.index('}') + 1]
        except ValueError:
            uri = ''
    except Exception as e:
        logging.info('{:s}: {:s}'.format(str(FileIOError(filename)), str(e)))
        raise FileIOError(filename)
    conversion = dict()
    alphabet = root.attrib['alphabet']
    if alphabet == 'ipa':
        conversion = sppasDictPron.load_sampa_ipa()
    for lexeme_root in tree.iter(tag=uri + 'lexeme'):
        grapheme_root = lexeme_root.find(uri + 'grapheme')
        if grapheme_root.text is None:
            continue
        grapheme = grapheme_root.text
        for phoneme_root in lexeme_root.findall(uri + 'phoneme'):
            if phoneme_root.text is None:
                continue
            phoneme = sppasUnicode(phoneme_root.text).to_strip()
            if len(phoneme) == 0:
                continue
            if alphabet == 'ipa':
                phoneme = sppasDictPron.ipa_to_sampa(conversion, phoneme)
            self.add_pron(grapheme, phoneme)

load_sampa_ipa

Load the sampa-ipa conversion file.

Return it as a dict().

View Source
@staticmethod
def load_sampa_ipa():
    """Load the sampa-ipa conversion file.

        Return it as a dict().

        """
    conversion = dict()
    ipa_sampa_mapfile = os.path.join(paths.resources, 'dict', 'sampa-ipa.txt')
    with codecs.open(ipa_sampa_mapfile, 'r', 'utf-8') as f:
        for line in f.readlines():
            tab_line = line.split()
            if len(tab_line) > 1:
                conversion[tab_line[1].strip()] = tab_line[0].strip()
        f.close()
    return conversion

ipa_to_sampa

Convert a string in IPA to SAMPA.

Parameters
  • conversion: (dict)
  • ipa_entry: (str)
View Source
@staticmethod
def ipa_to_sampa(conversion, ipa_entry):
    """Convert a string in IPA to SAMPA.

        :param conversion: (dict)
        :param ipa_entry: (str)

        """
    sampa = list()
    for p in ipa_entry:
        sampa_p = conversion.get(p, '_')
        if sampa_p != '_':
            if len(sampa) > 0 and sampa_p == ':' or sampa_p.startswith('_'):
                sampa[-1] = sampa[-1] + sampa_p
            else:
                sampa.append(sampa_p)
    return separators.phonemes.join(sampa)

Overloads

__len__

View Source
def __len__(self):
    return len(self._dict)

__contains__

View Source
def __contains__(self, item):
    s = sppasDictPron.format_token(item)
    return s in self._dict

__iter__

View Source
def __iter__(self):
    for a in self._dict:
        yield a