Module sppas.src.resources

Class sppasDictPron

Description

Pronunciation dictionary manager.

A pronunciation dictionary contains a list of tokens, each one with a list of possible pronunciations.

sppasDictPron can load the dictionary from an HTK-ASCII file. Each line of such file looks like the following: acted [acted] { k t e d acted(2) [acted] { k t i d The first columns indicates the tokens, eventually followed by the variant number into braces. The second column (with brackets) is ignored. It should contain the token. Other columns are the phones separated by whitespace. sppasDictPron accepts missing variant numbers, empty brackets, or missing brackets.

Example

 >>> d = sppasDictPron('eng.dict')
 >>> d.add_pron('acted', '{ k t e')
 >>> d.add_pron('acted', '{ k t i')

Then, the phonetization of a token can be accessed with get_pron() method:

Example

 >>> print(d.get_pron('acted'))
 >>> {-k-t-e-d|{-k-t-i-d|{-k-t-e|{-k-t-i

The following convention is adopted to represent the pronunciation variants:

'-' separates the phones (X-SAMPA standard)
'|' separates the variants

Notice that tokens in the dict are case-insensitive.

Constructor

Create a sppasDictPron instance.

A dump file is a binary version of the dictionary. Its size is greater than the original ASCII dictionary but the time to load is divided by two or three.

Parameters

dict_filename: (str) Name of the file of the pronunciation dict
nodump: (bool) Create or not a dump file.

View Source

def __init__(self, dict_filename=None, nodump=False):
    """Create a sppasDictPron instance.

    A dump file is a binary version of the dictionary. Its size is greater
    than the original ASCII dictionary but the time to load is divided
    by two or three.

    :param dict_filename: (str) Name of the file of the pronunciation dict
    :param nodump: (bool) Create or not a dump file.

    """
    self._filename = ''
    self._dict = dict()
    if dict_filename is not None:
        self._filename = dict_filename
        dp = sppasDumpFile(dict_filename)
        data = None
        if nodump is False:
            data = dp.load_from_dump()
        if data is None:
            self.load(dict_filename)
            if nodump is False:
                dp.save_as_dump(self._dict)
        else:
            self._dict = data

Public functions

get_filename

Return the name of the file from which the dict comes from.

View Source

def get_filename(self):
    """Return the name of the file from which the dict comes from."""
    return self._filename

get_unkstamp

Return the unknown words stamp.

View Source

def get_unkstamp(self):
    """Return the unknown words stamp."""
    return symbols.unk

get

Return the pronunciations of an entry in the dictionary.

Parameters

entry: (str) A token to find in the dictionary
substitution: (str) String to return if token is missing of dict

Returns

unicode of the pronunciations or the substitution.

View Source

def get(self, entry, substitution=symbols.unk):
    """Return the pronunciations of an entry in the dictionary.

        :param entry: (str) A token to find in the dictionary
        :param substitution: (str) String to return if token is missing of dict
        :returns: unicode of the pronunciations or the substitution.

        """
    s = sppasDictPron.format_token(entry)
    return self._dict.get(s, substitution)

get_pron

Return the pronunciations of an entry in the dictionary.

Parameters

entry: (str) A token to find in the dictionary

Returns

unicode of the pronunciations or the unknown stamp.

View Source

def get_pron(self, entry):
    """Return the pronunciations of an entry in the dictionary.

        :param entry: (str) A token to find in the dictionary
        :returns: unicode of the pronunciations or the unknown stamp.

        """
    s = sppasDictPron.format_token(entry)
    p = self._dict.get(s, symbols.unk)
    if p is None:
        return symbols.unk
    return p

is_unk

Return True if an entry is unknown (not in the dictionary).

Parameters

entry: (str) A token to find in the dictionary

Returns

bool

View Source

def is_unk(self, entry):
    """Return True if an entry is unknown (not in the dictionary).

        :param entry: (str) A token to find in the dictionary
        :returns: bool

        """
    return sppasDictPron.format_token(entry) not in self._dict

is_pron_of

Return True if pron is a pronunciation of entry.

Phonemes of pron are separated by "-".

Parameters

entry: (str) A unicode token to find in the dictionary
pron: (str) A unicode pronunciation

Returns

bool

View Source

def is_pron_of(self, entry, pron):
    """Return True if pron is a pronunciation of entry.

        Phonemes of pron are separated by "-".

        :param entry: (str) A unicode token to find in the dictionary
        :param pron: (str) A unicode pronunciation
        :returns: bool

        """
    s = sppasDictPron.format_token(entry)
    if s in self._dict:
        p = sppasUnicode(pron).to_strip()
        return p in self._dict[s].split(separators.variants)
    return False

format_token

Remove the CR/LF, tabs, multiple spaces and others... and lowerise.

Parameters

entry: (str) a token

Returns

formatted token

View Source

@staticmethod
def format_token(entry):
    """Remove the CR/LF, tabs, multiple spaces and others... and lowerise.

        :param entry: (str) a token
        :returns: formatted token

        """
    t = sppasUnicode(entry).to_strip()
    return sppasUnicode(t).to_lower()

add_pron

Add a token/pron to the dict.

Parameters

token: (str) Unicode string of the token to add
pron: (str) A pronunciation in which the phonemes are separated by whitespace

View Source

def add_pron(self, token, pron):
    """Add a token/pron to the dict.

        :param token: (str) Unicode string of the token to add
        :param pron: (str) A pronunciation in which the phonemes are separated by whitespace

        """
    entry = sppasDictPron.format_token(token)
    new_pron = sppasUnicode(pron).to_strip()
    new_pron = new_pron.replace(' ', separators.phonemes)
    cur_pron = ''
    if entry in self._dict:
        if self.is_pron_of(entry, new_pron) is False:
            cur_pron = self.get_pron(entry) + separators.variants
        else:
            cur_pron = self.get_pron(entry)
            new_pron = ''
    new_pron = cur_pron + new_pron
    self._dict[entry] = new_pron

map_phones

Create a new dictionary by changing the phoneme strings.

Perform changes depending on a mapping table.

Parameters

map_table: (Mapping) A mapping table

Returns

a sppasDictPron instance with mapped phones

View Source

def map_phones(self, map_table):
    """Create a new dictionary by changing the phoneme strings.

        Perform changes depending on a mapping table.

        :param map_table: (Mapping) A mapping table
        :returns: a sppasDictPron instance with mapped phones

        """
    map_table.set_reverse(True)
    delimiters = [separators.variants, separators.phonemes]
    new_dict = sppasDictPron()
    for key, value in self._dict.items():
        new_dict._dict[key] = map_table.map(value, delimiters)
    return new_dict

load

Load a pronunciation dictionary.

Parameters

filename: (str) Pronunciation dictionary file name

View Source

def load(self, filename):
    """Load a pronunciation dictionary.

        :param filename: (str) Pronunciation dictionary file name

        """
    try:
        with codecs.open(filename, 'r', sg.__encoding__) as fd:
            self._filename = filename
            first_line = fd.readline()
            fd.close()
    except IOError:
        raise FileIOError(filename)
    except UnicodeDecodeError:
        raise FileUnicodeError(filename)
    if first_line.startswith('<?xml'):
        self.load_from_pls(filename)
    else:
        self.load_from_ascii(filename)

load_from_ascii

Load a pronunciation dictionary from an HTK-ASCII file.

Parameters

filename: (str) Pronunciation dictionary file name

View Source

def load_from_ascii(self, filename):
    """Load a pronunciation dictionary from an HTK-ASCII file.

        :param filename: (str) Pronunciation dictionary file name

        """
    try:
        with codecs.open(filename, 'r', sg.__encoding__) as fd:
            lines = fd.readlines()
            fd.close()
    except Exception:
        raise FileIOError(filename)
    for l, line in enumerate(lines):
        uline = sppasUnicode(line).to_strip()
        if len(uline) == 0:
            continue
        if len(uline) == 1:
            raise FileFormatError(l, uline)
        i = uline.find('[')
        if i == -1:
            i = uline.find(' ')
        entry = uline[:i]
        endline = uline[i:]
        j = endline.find(']')
        if j == -1:
            j = endline.find(' ')
        new_pron = endline[j + 1:]
        i = entry.find('(')
        if i > -1:
            if ')' in entry[i:]:
                entry = entry[:i]
        self.add_pron(entry, new_pron)

save_as_ascii

Save the pronunciation dictionary in HTK-ASCII format.

Parameters

filename: (str) Dictionary file name
withvariantnb: (bool) Write the variant number or not
withfilledbrackets: (bool) Fill the bracket with the token

View Source

def save_as_ascii(self, filename, with_variant_nb=True, with_filled_brackets=True):
    """Save the pronunciation dictionary in HTK-ASCII format.

        :param filename: (str) Dictionary file name
        :param with_variant_nb: (bool) Write the variant number or not
        :param with_filled_brackets: (bool) Fill the bracket with the token

        """
    try:
        with codecs.open(filename, 'w', encoding=sg.__encoding__) as output:
            for entry, value in sorted(self._dict.items(), key=lambda x: x[0]):
                variants = value.split(separators.variants)
                for i, variant in enumerate(variants, 1):
                    variant = variant.replace(separators.phonemes, ' ')
                    brackets = entry
                    if with_filled_brackets is False:
                        brackets = ''
                    if i > 1 and with_variant_nb is True:
                        line = '{:s}({:d}) [{:s}] {:s}\n'.format(entry, i, brackets, variant)
                    else:
                        line = '{:s} [{:s}] {:s}\n'.format(entry, brackets, variant)
                    output.write(line)
    except Exception as e:
        logging.info('Saving the dictionary in ASCII failed: {:s}'.format(str(e)))
        return False
    return True

load_from_pls

Load a pronunciation dictionary from a pls file (xml).

xmlns="http://www.w3.org/2005/01/pronunciation-lexicon

Parameters

filename: (str) Pronunciation dictionary file name

View Source

def load_from_pls(self, filename):
    """Load a pronunciation dictionary from a pls file (xml).

        xmlns="http://www.w3.org/2005/01/pronunciation-lexicon

        :param filename: (str) Pronunciation dictionary file name

        """
    try:
        tree = ET.parse(filename)
        root = tree.getroot()
        try:
            uri = root.tag[:root.tag.index('}') + 1]
        except ValueError:
            uri = ''
    except Exception as e:
        logging.info('{:s}: {:s}'.format(str(FileIOError(filename)), str(e)))
        raise FileIOError(filename)
    conversion = dict()
    alphabet = root.attrib['alphabet']
    if alphabet == 'ipa':
        conversion = sppasDictPron.load_sampa_ipa()
    for lexeme_root in tree.iter(tag=uri + 'lexeme'):
        grapheme_root = lexeme_root.find(uri + 'grapheme')
        if grapheme_root.text is None:
            continue
        grapheme = grapheme_root.text
        for phoneme_root in lexeme_root.findall(uri + 'phoneme'):
            if phoneme_root.text is None:
                continue
            phoneme = sppasUnicode(phoneme_root.text).to_strip()
            if len(phoneme) == 0:
                continue
            if alphabet == 'ipa':
                phoneme = sppasDictPron.ipa_to_sampa(conversion, phoneme)
            self.add_pron(grapheme, phoneme)

load_sampa_ipa

Load the sampa-ipa conversion file.

Return it as a dict().

View Source

@staticmethod
def load_sampa_ipa():
    """Load the sampa-ipa conversion file.

        Return it as a dict().

        """
    conversion = dict()
    ipa_sampa_mapfile = os.path.join(paths.resources, 'dict', 'sampa-ipa.txt')
    with codecs.open(ipa_sampa_mapfile, 'r', 'utf-8') as f:
        for line in f.readlines():
            tab_line = line.split()
            if len(tab_line) > 1:
                conversion[tab_line[1].strip()] = tab_line[0].strip()
        f.close()
    return conversion

ipa_to_sampa

Convert a string in IPA to SAMPA.

Parameters

conversion: (dict)
ipa_entry: (str)

View Source

@staticmethod
def ipa_to_sampa(conversion, ipa_entry):
    """Convert a string in IPA to SAMPA.

        :param conversion: (dict)
        :param ipa_entry: (str)

        """
    sampa = list()
    for p in ipa_entry:
        sampa_p = conversion.get(p, '_')
        if sampa_p != '_':
            if len(sampa) > 0 and sampa_p == ':' or sampa_p.startswith('_'):
                sampa[-1] = sampa[-1] + sampa_p
            else:
                sampa.append(sampa_p)
    return separators.phonemes.join(sampa)

Overloads

len

View Source

def __len__(self):
    return len(self._dict)

contains

View Source

def __contains__(self, item):
    s = sppasDictPron.format_token(item)
    return s in self._dict

iter

View Source

def __iter__(self):
    for a in self._dict:
        yield a