calculus package

Subpackages

Submodules

calculus.calculusexc module

filename

sppas.src.calculus.calculusexc.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Exceptions for the package calculus.

exception calculus.calculusexc.EmptyError[source]

Bases: Exception

:ERROR 3030:.

The given data must be defined or must not be empty.

__init__()[source]
exception calculus.calculusexc.EuclidianDistanceError[source]

Bases: ValueError

:ERROR 3025:.

Error while estimating Euclidian distances of rows and columns.

__init__()[source]
exception calculus.calculusexc.InsideIntervalError(value, min_value, max_value)[source]

Bases: ValueError

:ERROR 3040:.

Value {value} is out of range: expected value in range [{min_value},{max_value}].

__init__(value, min_value, max_value)[source]
exception calculus.calculusexc.ProbabilityError(value=None)[source]

Bases: Exception

:ERROR 3015:.

Value must range between 0 and 1. Got {:f}.

__init__(value=None)[source]
exception calculus.calculusexc.SumProbabilityError(value=None)[source]

Bases: Exception

:ERROR 3016:.

Probabilities must sum to 1. Got {:f}.

__init__(value=None)[source]
exception calculus.calculusexc.VectorsError[source]

Bases: Exception

:ERROR 3010:.

Both vectors p and q must have the same length and must contain probabilities.

__init__()[source]

Module contents

filename

sppas.src.calculus.__init__.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Package for the calculus of SPPAS.

calculus: proposes some math on data.

This package includes mathematical functions to estimate descriptive statistics, for the scoring or in the domain of the information theory.

No required other package. This package is compatible with all versions of Python (from 2.7 to 3.9).

calculus.chi_squared(x, y)[source]

Estimate the Chi-squared distance between two tuples.

Parameters
  • x – a tuple of float values

  • y – a tuple of float values

Returns

(float)

x and y must have the same length.

>>> x = (1.0, 0.0)
>>> y = (0.0, 1.0)
>>> round(chi_squared(x, y), 3)
>>> 1.414
calculus.compute_error_for_line_given_points(b, m, points)[source]

Error function (also called a cost function).

It measures how “good” a given line is.

This function will take in a (m,b) pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each (x,y) point in our data set and sum the square distances between each point’s y value and the candidate line’s y value (computed at mx + b).

Lines that fit our data better (where better is defined by our error function) will result in lower error values.

calculus.euclidian(x, y)[source]

Estimate the Euclidian distance between two tuples.

Parameters
  • x – a tuple of float values

  • y – a tuple of float values

Returns

(float)

x and y must have the same length.

>>> x = (1.0, 0.0)
>>> y = (0.0, 1.0)
>>> round(euclidian(x, y), 3)
>>> 1.414
calculus.fgeometricmean(items)[source]

Calculate the geometric mean of the data values.

n-th root of (x1 * x2 * … * xn).

Parameters

items – (list) list of data values

Returns

(float)

calculus.fharmonicmean(items)[source]

Calculate the harmonic mean of the data values.

C{n / (1/x1 + 1/x2 + … + 1/xn)}.

Parameters

items – (list) list of data values

:returns (float)

calculus.find_ngrams(symbols, ngram)[source]

Return a list of n-grams from a list of symbols.

Parameters
  • symbols – (list)

  • ngram – (int) n value for the ngrams

Returns

list of tuples

Example:

>>>symbols=[0,1,0,1,1,1,0] >>>print(find_ngrams(symbols, 2)) >>>[(0, 1), (1, 0), (0, 1), (1, 1), (1, 1), (1, 0)]

calculus.fmax(items)[source]

Return the maximum of the data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.fmean(items)[source]

Calculate the arithmetic mean of the data values.

sum(items)/len(items)

Parameters

items – (list) list of data values

Returns

(float)

calculus.fmedian(items)[source]

Calculate the ‘middle’ score of the data values.

If there is an even number of scores, the mean of the 2 middle scores is returned.

Parameters

items – (list) list of data values

Returns

(float)

calculus.fmin(items)[source]

Return the minimum of the data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.fmult(items)[source]

Estimate the product of a list of data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.freq(mylist, item)[source]

Return the relative frequency of an item of a list.

Parameters
  • mylist – (list) list of elements

  • item – (any) an element of the list (or not!)

Returns

frequency (float) of item in mylist

calculus.fsum(items)[source]

Estimate the sum of a list of data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.gradient_descent(points, starting_b, starting_m, learning_rate, num_iterations)[source]

Gradient descent is an algorithm that minimizes functions.

Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.

Parameters
  • points – a list of tuples (x,y) of float values.

  • starting_b – (float)

  • starting_m – (float)

  • learning_rate – (float)

  • num_iterations – (int)

Returns

intercept, slope

calculus.gradient_descent_linear_regression(points, num_iterations=50000)[source]

Gradient descent method for linear regression.

adapted from: http://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/

Parameters
  • points – a list of tuples (x,y) of float values.

  • num_iterations – (int)

Returns

intercept, slope

calculus.intercept(p1, p2)[source]

Estimate the intercept between 2 points.

Parameters
  • p1 – (tuple) first point as (x1, y1)

  • p2 – (tuple) second point as (x2, y2)

Returns

float value

calculus.linear_fct(x, a, b)[source]

Return f(x) of the linear function f(x) = ax + b.

Parameters
  • x – (float) X-coord

  • a – (float) slope

  • b – (float) intercept

calculus.linear_values(delta, p1, p2, rounded=6)[source]

Estimate the values between 2 points, step-by-step.

Two different points p1=(x1,y1) and p2=(x2,y2) determine a line. It is enough to substitute two different values for ‘x’ in the linear function and determine ‘y’ for each of these values.

a = y2 − y1 / x2 − x1 <= slope b = y1 - a * x1 <= intercept

Values for p1 and p2 are added into the result.

Parameters
  • delta – (float) Step range between values.

  • p1 – (tuple) first point as (x1, y1)

  • p2 – (tuple) second point as (x2, y2)

  • rounded – (int) round floats

Returns

list of float values, i.e. all the y, including the ones of p1 and p2

Raises

MemoryError could be raised if too many values have to be returned.

calculus.lkurtosis(items)[source]

Return the kurtosis of a distribution.

The kurtosis represents a measure of the “peakedness”: a high kurtosis distribution has a sharper peak and fatter tails, while a low kurtosis distribution has a more rounded peak and thinner tails.

Parameters

items – (list) list of data values

Returns

(float)

calculus.lmoment(items, moment=1)[source]

Calculate the r-th moment about the mean for a sample.

1/n * SUM((items(i)-mean)**r)

Parameters
  • items – (list) list of data values

  • moment

Returns

(float)

calculus.lskew(items)[source]

Calculate the skewness of a distribution.

The skewness represents a measure of the asymmetry: an understanding of the skewness of the dataset indicates whether deviations from the mean are going to be positive or negative.

Parameters

items – (list) list of data values

Returns

(float)

calculus.lstdev(items)[source]

Calculate the standard deviation of the data values, for a population.

The standard deviation is the positive square root of the variance.

Parameters

items – (list) list of data values

Returns

(float)

calculus.lvariance(items)[source]

Calculate the variance of the data values, for a population.

It means that the estimation is using N for the denominator. The variance is a measure of dispersion near the mean.

Parameters

items – (list) list of data values

Returns

(float)

calculus.lvariation(items)[source]

Calculate the coefficient of variation of data values.

It shows the extent of variability in relation to the mean. It’s a standardized measure of dispersion: stdev / mean and returned as a percentage.

Parameters

items – (list) list of data values

Returns

(float)

calculus.lz(items, score)[source]

Calculate the z-score for a given input score.

given that score and the data values from which that score came.

The z-score determines the relative location of a data value.

Parameters
  • items – (list) list of data values

  • score – (float) a score of any items

Returns

(float)

calculus.manathan(x, y)[source]

Estimate the Manathan distance between two tuples.

Parameters
  • x – a tuple of float values

  • y – a tuple of float values

Returns

(float)

x and y must have the same length.

>>> x = (1.0, 0.0)
>>> y = (0.0, 1.0)
>>> manathan(x, y)
>>> 2.0
calculus.minkowski(x, y, p=2)[source]

Estimate the Minkowski distance between two tuples.

Parameters
  • x – a tuple of float values

  • y – a tuple of float values

  • p – power value (p=2 corresponds to the euclidian distance)

Returns

(float)

x and y must have the same length.

>>> x = (1.0, 0.0)
>>> y = (0.0, 1.0)
>>> round(minkowski(x, y), 3)
>>> 1.414
calculus.nPVI(items)[source]

Calculate the Normalized Pairwise Variability Index.

Parameters

items – (list) list of data values

Returns

(float)

calculus.percent(mylist, item)[source]

Return the percentage of an item of a list.

Parameters
  • mylist – (list) list of elements

  • item – (any) an element of the list (or not!)

Returns

percentage (float) of item in mylist

calculus.percentile(mylist, p=(25, 50, 75), sort=True)[source]

Return the pth percentile of an unsorted or sorted numeric list.

This is equivalent to calling quantile(mylist, p/100.0).

>>> round(percentile([15, 20, 40, 35, 50], 40), 2)
26.0
>>> for perc in percentile([15, 20, 40, 35, 50], (0, 25, 50, 75, 100)):
...     print("{:.2f}".format(perc))
...
15.00
17.50
35.00
45.00
50.00
Parameters
  • mylist – (list) list of elements.

  • p – (tuple) the percentile we are looking for.

  • sort – whether to sort the vector.

Returns

percentile as a float

calculus.quantile(mylist, q=(0.25, 0.5, 0.75), sort=True)[source]

Return the qth quantile of an unsorted or sorted numeric list.

Calculates a rank n as q(N+1), where N is the number of items in mylist, then splits n into its integer component k and decimal component d. If k <= 1, returns the first element; if k >= N, returns the last element; otherwise returns the linear interpolation between mylist[k-1] and mylist[k] using a factor d.

>>> round(quantile([15, 20, 40, 35, 50], 0.4), 2)
26.0
Parameters
  • mylist – (list) list of elements.

  • q – (tuple) the quantile we are looking for.

  • sort – whether to sort the vector.

Returns

quantile as a float

calculus.rPVI(items)[source]

Calculate the Raw Pairwise Variability Index.

Parameters

items – (list) list of data values

Returns

(float)

calculus.slope(p1, p2)[source]

Estimate the slope between 2 points.

Parameters
  • p1 – (tuple) first point as (x1, y1)

  • p2 – (tuple) second point as (x2, y2)

Returns

float value

calculus.slope_intercept(p1, p2)[source]

Return the slope and the intercept.

Parameters
  • p1 – (tuple) first point as (x1, y1)

  • p2 – (tuple) second point as (x2, y2)

Returns

tuple(slope,intercept)

class calculus.sppasDescriptiveStatistics(dict_items)[source]

Bases: object

Descriptive statistics estimator class.

This class estimates descriptive statistics on a set of data values, stored in a dictionary:

  • the key is the name of the data set;

  • the value is the list of data values for this data set.

>>> d = {'apples':[1, 2, 3, 4], 'peers':[2, 3, 3, 5]}
>>> s = sppasDescriptiveStatistics(d)
>>> total = s.total()
>>> print(total)
>>> (('peers', 13.0), ('apples', 10.0))
__init__(dict_items)[source]

Descriptive statistics.

Parameters

dict_items – a dict of tuples (key, [values])

coefvariation()[source]

Estimate the coefficient of variation of data values.

Returns

(dict) a dictionary of (key, coefvariation) of float

values (given as a percentage).

len()[source]

Estimate the number of occurrences of data values.

Returns

(dict) a dictionary of tuples (key, len)

max()[source]

Estimate the maximum of data values.

Returns

(dict) a dictionary of (key, max) of float values

mean()[source]

Estimate the arithmetic mean of data values.

Returns

(dict) a dictionary of (key, mean) of float values

median()[source]

Estimate the ‘middle’ score of the data values.

Returns

(dict) a dictionary of (key, mean) of float values

min()[source]

Estimate the minimum of data values.

Returns

(dict) a dictionary of (key, min) of float values

stdev()[source]

Estimate the standard deviation of data values.

Returns

(dict) a dictionary of (key, stddev) of float values

total()[source]

Estimate the sum of data values.

Returns

(dict) a dictionary of tuples (key, total) of float values

variance()[source]

Estimate the unbiased sample variance of data values.

Returns

(dict) a dictionary of (key, variance) of float values

zscore()[source]

Estimate the z-scores of data values.

The z-score determines the relative location of a data value.

Returns

(dict) a dictionary of (key, [z-scores]) of float values

class calculus.sppasEntropy(symbols, n=1)[source]

Bases: object

Entropy estimation.

Entropy is a measure of unpredictability of information content. Entropy is one of several ways to measure diversity.

If we want to look at the entropy on a large series, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.

__init__(symbols, n=1)[source]

Create a sppasEntropy instance with a list of symbols.

Parameters
  • symbols – (list) a vector of symbols of any type.

  • n – (int) n value for n-gram estimation. n ranges 1..MAX_NGRAM

eval()[source]

Estimate the Shannon entropy of a vector of symbols.

Shannon’s entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable).

Returns

(float) entropy value

set_ngram(n)[source]

Set the n value of n-grams.

Parameters

n – (int) n value for n-gram estimation. n ranges 1..8

set_symbols(symbols)[source]

Set the list of symbols.

Parameters

symbols – (list) a vector of symbols of any type.

class calculus.sppasKappa(p=[], q=[])[source]

Bases: object

Inter-observer variation estimation.

The calculation is based on the difference between how much agreement is actually present (“observed” agreement) compared to how much agreement would be expected to be present by chance alone (“expected” agreement).

Imagine a situation in which annotators have to answer Yes or No to 5 questions.

  • Person “P” answered: Yes, No, No, Yes, Yes

  • Person “Q” answered: Yes, No, Yes, Yes, Yes

This results in the following vectors of probabilities:

>>> p = [(1., 0.), (0., 1.), (0., 1.), (1., 0.), (1., 0.)]
>>> q = [(1., 0.), (0., 1.), (1., 0.), (1., 0.), (1., 0.)]

The Cohen’s Kappa is then evaluated as follow:

>>> sppasKappa.check_vector(p)
>>> True
>>> sppasKappa.check_vector(q)
>>> True
>>> kappa = sppasKappa(p, q)
>>> kappa.evaluate()
>>> 0.54545
__init__(p=[], q=[])[source]

Create a sppasKappa instance with two lists of tuples p and q.

>>> p=[(1., 0.), (1., 0.), (0.8, 0.2)]
Parameters
  • p – a vector of tuples of float values

  • q – a vector of tuples of float values

check()[source]

Check if the given p and q vectors are correct to be used.

Returns

bool

static check_vector(v)[source]

Check if the vector is correct to be used.

Parameters

v – a vector of tuples of probabilities.

evaluate()[source]

Estimate the Cohen’s Kappa between two lists of tuples p and q.

The tuple size corresponds to the number of categories, each value is the score assigned to each category for a given sample.

Returns

float value

set_vectors(p, q)[source]

Set the vectors of probabilities to estimate the sppasKappa value.

Parameters
  • p – a vector of tuples of float values

  • q – a vector of tuples of float values

sqm()[source]

Estimate the Euclidian distance between two vectors.

Returns

row, col

sqv()[source]

Estimate the Euclidian distance between two vectors.

Returns

v

class calculus.sppasKullbackLeibler(model=None, observations=None)[source]

Bases: object

Kullback-Leibler distance estimator.

In probability theory and information theory, the Kullback–Leibler divergence (also called relative entropy) is a measure of the difference between two probability distributions P and Q. It is not symmetric in P and Q.

Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P‖Q), is a measure of the information gained when one revises ones beliefs from the prior probability distribution Q to the posterior probability distribution P.

However, the sppasKullbackLeibler class estimates the KL distance, i.e. the symmetric Kullback-Leibler divergence.

This sppasKullbackLeibler class implements the distance estimation between a model and the content of a moving window on data, as described in:

Brigitte Bigi, Renato De Mori, Marc El-Bèze, Thierry Spriet (1997). Combined models for topic spotting and topic-dependent language modeling IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (ASRU), Edited by S. Furui, B. H. Huang and Wu Chu, IEEE Signal Processing Society Publ, NY, pages 535-542.

This KL distance can also be used to estimate the distance between documents for text categorization, as proposed in:

Brigitte Bigi (2003). Using Kullback-Leibler Distance for Text Categorization. Lecture Notes in Computer Science, Advances in Information Retrieval, ISSN 0302-9743, Fabrizio Sebastiani (Editor), Springer-Verlag (Publisher), pages 305–319, Pisa (Italy).

In this class…

A model is a dictionary with:

  • key is an n-gram,

  • value is a probability.

The window of observed symbols is represented as a list of n-grams.

DEFAULT_EPSILON = 1e-06
__init__(model=None, observations=None)[source]

Create a sppasKullbackLeibler instance from a list of symbols.

Parameters
  • model – (dict) a dictionary with key=item, value=probability

  • observations – list ob observed items

eval_kld()[source]

Estimate the KL distance between a model and observations.

Returns

float value

get_epsilon()[source]

Return the epsilon value.

get_model()[source]

Return the model.

set_epsilon(eps)[source]

Fix the linear back-off value for unknown observations.

The optimal value for this coefficient is the product of the size of both model and observations to estimate the KL. This value must be significantly lower than the minimum value in the model.

Parameters

eps – (float) Epsilon value.

If eps is set to 0, a default value will be assigned.

set_model(model)[source]

Set the model.

Parameters

model – (dict) Probability distribution of the model.

set_model_from_data(data)[source]

Set the model from a given set of observations.

Parameters

data – (list) List of observed items.

set_observations(observations)[source]

Fix the set of observed items.

Parameters

observations – (list) The list of observed items.

calculus.squared_euclidian(x, y)[source]

Estimate the Squared Euclidian distance between two tuples.

Parameters
  • x – a tuple of float values

  • y – a tuple of float values

Returns

(float)

x and y must have the same length.

>>> x = (1.0, 0.0)
>>> y = (0.0, 1.0)
>>> squared_euclidian(x, y)
>>> 2.0
calculus.symbols_to_items(symbols, ngram)[source]

Convert a list of symbols into a dictionary of items.

Example:

>>>symbols=[0, 1, 0, 1, 1, 1, 0] >>>print symbols_to_items(symbols,2) >>>{(0, 1): 2, (1, 0): 2, (1, 1): 2}

Returns

dictionary with key=tuple of symbols, value=number of occurrences

calculus.tansey_linear_regression(points)[source]

Linear regression, as proposed in AnnotationPro.

http://annotationpro.org/

Translated from C# code from here: https://gist.github.com/tansey/1375526

Parameters

points – a list of tuples (x,y) of float values.

Returns

intercept, slope

calculus.tga_linear_regression(points)[source]

Linear regression as proposed in TGA, by Dafydd Gibbon.

http://wwwhomes.uni-bielefeld.de/gibbon/TGA/

Parameters

points – a list of tuples (x,y) of float values.

Returns

intercept, slope

calculus.ubpa(vector, text, fp=<colorama.ansitowin32.StreamWrapper object>, delta_max=0.04, step=0.01)[source]

Estimate the Unit Boundary Positioning Accuracy.

Parameters
  • vector – contains the list of the delta values.

  • text – one of “Duration”, “Position Start”, …

  • fp – a file pointer

  • delta_max – Maximum delta duration to print result (default: 40ms)

  • step – Delta time (default: 10ms)

Returns

(tab_neg, tab_pos) with number of occurrences of each position

calculus.ylinear_fct(y, a, b)[source]

Return x of the linear function y = ax + b.

Parameters
  • y – (float) Y-coord

  • a – (float) slope

  • b – (float) intercept