calculus package¶
Subpackages¶
Submodules¶
calculus.calculusexc module¶
- filename
sppas.src.calculus.calculusexc.py
- author
Brigitte Bigi
- contact
- summary
Exceptions for the package calculus.
- exception calculus.calculusexc.EmptyError[source]¶
Bases:
Exception
:ERROR 3030:.
The given data must be defined or must not be empty.
- exception calculus.calculusexc.EuclidianDistanceError[source]¶
Bases:
ValueError
:ERROR 3025:.
Error while estimating Euclidian distances of rows and columns.
- exception calculus.calculusexc.InsideIntervalError(value, min_value, max_value)[source]¶
Bases:
ValueError
:ERROR 3040:.
Value {value} is out of range: expected value in range [{min_value},{max_value}].
- exception calculus.calculusexc.ProbabilityError(value=None)[source]¶
Bases:
Exception
:ERROR 3015:.
Value must range between 0 and 1. Got {:f}.
Module contents¶
- filename
sppas.src.calculus.__init__.py
- author
Brigitte Bigi
- contact
- summary
Package for the calculus of SPPAS.
calculus: proposes some math on data.¶
This package includes mathematical functions to estimate descriptive statistics, for the scoring or in the domain of the information theory.
No required other package. This package is compatible with all versions of Python (from 2.7 to 3.9).
- calculus.chi_squared(x, y)[source]¶
Estimate the Chi-squared distance between two tuples.
- Parameters
x – a tuple of float values
y – a tuple of float values
- Returns
(float)
x and y must have the same length.
>>> x = (1.0, 0.0) >>> y = (0.0, 1.0) >>> round(chi_squared(x, y), 3) >>> 1.414
- calculus.compute_error_for_line_given_points(b, m, points)[source]¶
Error function (also called a cost function).
It measures how “good” a given line is.
This function will take in a (m,b) pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each (x,y) point in our data set and sum the square distances between each point’s y value and the candidate line’s y value (computed at mx + b).
Lines that fit our data better (where better is defined by our error function) will result in lower error values.
- calculus.euclidian(x, y)[source]¶
Estimate the Euclidian distance between two tuples.
- Parameters
x – a tuple of float values
y – a tuple of float values
- Returns
(float)
x and y must have the same length.
>>> x = (1.0, 0.0) >>> y = (0.0, 1.0) >>> round(euclidian(x, y), 3) >>> 1.414
- calculus.fgeometricmean(items)[source]¶
Calculate the geometric mean of the data values.
n-th root of (x1 * x2 * … * xn).
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.fharmonicmean(items)[source]¶
Calculate the harmonic mean of the data values.
C{n / (1/x1 + 1/x2 + … + 1/xn)}.
- Parameters
items – (list) list of data values
:returns (float)
- calculus.find_ngrams(symbols, ngram)[source]¶
Return a list of n-grams from a list of symbols.
- Parameters
symbols – (list)
ngram – (int) n value for the ngrams
- Returns
list of tuples
Example:
>>>symbols=[0,1,0,1,1,1,0] >>>print(find_ngrams(symbols, 2)) >>>[(0, 1), (1, 0), (0, 1), (1, 1), (1, 1), (1, 0)]
- calculus.fmax(items)[source]¶
Return the maximum of the data values.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.fmean(items)[source]¶
Calculate the arithmetic mean of the data values.
sum(items)/len(items)
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.fmedian(items)[source]¶
Calculate the ‘middle’ score of the data values.
If there is an even number of scores, the mean of the 2 middle scores is returned.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.fmin(items)[source]¶
Return the minimum of the data values.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.fmult(items)[source]¶
Estimate the product of a list of data values.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.freq(mylist, item)[source]¶
Return the relative frequency of an item of a list.
- Parameters
mylist – (list) list of elements
item – (any) an element of the list (or not!)
- Returns
frequency (float) of item in mylist
- calculus.fsum(items)[source]¶
Estimate the sum of a list of data values.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.gradient_descent(points, starting_b, starting_m, learning_rate, num_iterations)[source]¶
Gradient descent is an algorithm that minimizes functions.
Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.
- Parameters
points – a list of tuples (x,y) of float values.
starting_b – (float)
starting_m – (float)
learning_rate – (float)
num_iterations – (int)
- Returns
intercept, slope
- calculus.gradient_descent_linear_regression(points, num_iterations=50000)[source]¶
Gradient descent method for linear regression.
adapted from: http://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
- Parameters
points – a list of tuples (x,y) of float values.
num_iterations – (int)
- Returns
intercept, slope
- calculus.intercept(p1, p2)[source]¶
Estimate the intercept between 2 points.
- Parameters
p1 – (tuple) first point as (x1, y1)
p2 – (tuple) second point as (x2, y2)
- Returns
float value
- calculus.linear_fct(x, a, b)[source]¶
Return f(x) of the linear function f(x) = ax + b.
- Parameters
x – (float) X-coord
a – (float) slope
b – (float) intercept
- calculus.linear_values(delta, p1, p2, rounded=6)[source]¶
Estimate the values between 2 points, step-by-step.
Two different points p1=(x1,y1) and p2=(x2,y2) determine a line. It is enough to substitute two different values for ‘x’ in the linear function and determine ‘y’ for each of these values.
a = y2 − y1 / x2 − x1 <= slope b = y1 - a * x1 <= intercept
Values for p1 and p2 are added into the result.
- Parameters
delta – (float) Step range between values.
p1 – (tuple) first point as (x1, y1)
p2 – (tuple) second point as (x2, y2)
rounded – (int) round floats
- Returns
list of float values, i.e. all the y, including the ones of p1 and p2
- Raises
MemoryError could be raised if too many values have to be returned.
- calculus.lkurtosis(items)[source]¶
Return the kurtosis of a distribution.
The kurtosis represents a measure of the “peakedness”: a high kurtosis distribution has a sharper peak and fatter tails, while a low kurtosis distribution has a more rounded peak and thinner tails.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.lmoment(items, moment=1)[source]¶
Calculate the r-th moment about the mean for a sample.
1/n * SUM((items(i)-mean)**r)
- Parameters
items – (list) list of data values
moment –
- Returns
(float)
- calculus.lskew(items)[source]¶
Calculate the skewness of a distribution.
The skewness represents a measure of the asymmetry: an understanding of the skewness of the dataset indicates whether deviations from the mean are going to be positive or negative.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.lstdev(items)[source]¶
Calculate the standard deviation of the data values, for a population.
The standard deviation is the positive square root of the variance.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.lvariance(items)[source]¶
Calculate the variance of the data values, for a population.
It means that the estimation is using N for the denominator. The variance is a measure of dispersion near the mean.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.lvariation(items)[source]¶
Calculate the coefficient of variation of data values.
It shows the extent of variability in relation to the mean. It’s a standardized measure of dispersion: stdev / mean and returned as a percentage.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.lz(items, score)[source]¶
Calculate the z-score for a given input score.
given that score and the data values from which that score came.
The z-score determines the relative location of a data value.
- Parameters
items – (list) list of data values
score – (float) a score of any items
- Returns
(float)
- calculus.manathan(x, y)[source]¶
Estimate the Manathan distance between two tuples.
- Parameters
x – a tuple of float values
y – a tuple of float values
- Returns
(float)
x and y must have the same length.
>>> x = (1.0, 0.0) >>> y = (0.0, 1.0) >>> manathan(x, y) >>> 2.0
- calculus.minkowski(x, y, p=2)[source]¶
Estimate the Minkowski distance between two tuples.
- Parameters
x – a tuple of float values
y – a tuple of float values
p – power value (p=2 corresponds to the euclidian distance)
- Returns
(float)
x and y must have the same length.
>>> x = (1.0, 0.0) >>> y = (0.0, 1.0) >>> round(minkowski(x, y), 3) >>> 1.414
- calculus.nPVI(items)[source]¶
Calculate the Normalized Pairwise Variability Index.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.percent(mylist, item)[source]¶
Return the percentage of an item of a list.
- Parameters
mylist – (list) list of elements
item – (any) an element of the list (or not!)
- Returns
percentage (float) of item in mylist
- calculus.percentile(mylist, p=(25, 50, 75), sort=True)[source]¶
Return the pth percentile of an unsorted or sorted numeric list.
This is equivalent to calling quantile(mylist, p/100.0).
>>> round(percentile([15, 20, 40, 35, 50], 40), 2) 26.0 >>> for perc in percentile([15, 20, 40, 35, 50], (0, 25, 50, 75, 100)): ... print("{:.2f}".format(perc)) ... 15.00 17.50 35.00 45.00 50.00
- Parameters
mylist – (list) list of elements.
p – (tuple) the percentile we are looking for.
sort – whether to sort the vector.
- Returns
percentile as a float
- calculus.quantile(mylist, q=(0.25, 0.5, 0.75), sort=True)[source]¶
Return the qth quantile of an unsorted or sorted numeric list.
Calculates a rank n as q(N+1), where N is the number of items in mylist, then splits n into its integer component k and decimal component d. If k <= 1, returns the first element; if k >= N, returns the last element; otherwise returns the linear interpolation between mylist[k-1] and mylist[k] using a factor d.
>>> round(quantile([15, 20, 40, 35, 50], 0.4), 2) 26.0
- Parameters
mylist – (list) list of elements.
q – (tuple) the quantile we are looking for.
sort – whether to sort the vector.
- Returns
quantile as a float
- calculus.rPVI(items)[source]¶
Calculate the Raw Pairwise Variability Index.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.slope(p1, p2)[source]¶
Estimate the slope between 2 points.
- Parameters
p1 – (tuple) first point as (x1, y1)
p2 – (tuple) second point as (x2, y2)
- Returns
float value
- calculus.slope_intercept(p1, p2)[source]¶
Return the slope and the intercept.
- Parameters
p1 – (tuple) first point as (x1, y1)
p2 – (tuple) second point as (x2, y2)
- Returns
tuple(slope,intercept)
- class calculus.sppasDescriptiveStatistics(dict_items)[source]¶
Bases:
object
Descriptive statistics estimator class.
This class estimates descriptive statistics on a set of data values, stored in a dictionary:
the key is the name of the data set;
the value is the list of data values for this data set.
>>> d = {'apples':[1, 2, 3, 4], 'peers':[2, 3, 3, 5]} >>> s = sppasDescriptiveStatistics(d) >>> total = s.total() >>> print(total) >>> (('peers', 13.0), ('apples', 10.0))
- __init__(dict_items)[source]¶
Descriptive statistics.
- Parameters
dict_items – a dict of tuples (key, [values])
- coefvariation()[source]¶
Estimate the coefficient of variation of data values.
- Returns
(dict) a dictionary of (key, coefvariation) of float
values (given as a percentage).
- len()[source]¶
Estimate the number of occurrences of data values.
- Returns
(dict) a dictionary of tuples (key, len)
- max()[source]¶
Estimate the maximum of data values.
- Returns
(dict) a dictionary of (key, max) of float values
- mean()[source]¶
Estimate the arithmetic mean of data values.
- Returns
(dict) a dictionary of (key, mean) of float values
- median()[source]¶
Estimate the ‘middle’ score of the data values.
- Returns
(dict) a dictionary of (key, mean) of float values
- min()[source]¶
Estimate the minimum of data values.
- Returns
(dict) a dictionary of (key, min) of float values
- stdev()[source]¶
Estimate the standard deviation of data values.
- Returns
(dict) a dictionary of (key, stddev) of float values
- total()[source]¶
Estimate the sum of data values.
- Returns
(dict) a dictionary of tuples (key, total) of float values
- class calculus.sppasEntropy(symbols, n=1)[source]¶
Bases:
object
Entropy estimation.
Entropy is a measure of unpredictability of information content. Entropy is one of several ways to measure diversity.
If we want to look at the entropy on a large series, we could also compute the entropy for windows to measure the evenness or uncertainties. By looking at the definition, one could predict the areas that have a lot of variance would result in a higher entropy and the areas that have lower variance would result in lower entropy.
- __init__(symbols, n=1)[source]¶
Create a sppasEntropy instance with a list of symbols.
- Parameters
symbols – (list) a vector of symbols of any type.
n – (int) n value for n-gram estimation. n ranges 1..MAX_NGRAM
- eval()[source]¶
Estimate the Shannon entropy of a vector of symbols.
Shannon’s entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable).
- Returns
(float) entropy value
- class calculus.sppasKappa(p=[], q=[])[source]¶
Bases:
object
Inter-observer variation estimation.
The calculation is based on the difference between how much agreement is actually present (“observed” agreement) compared to how much agreement would be expected to be present by chance alone (“expected” agreement).
Imagine a situation in which annotators have to answer Yes or No to 5 questions.
Person “P” answered: Yes, No, No, Yes, Yes
Person “Q” answered: Yes, No, Yes, Yes, Yes
This results in the following vectors of probabilities:
>>> p = [(1., 0.), (0., 1.), (0., 1.), (1., 0.), (1., 0.)] >>> q = [(1., 0.), (0., 1.), (1., 0.), (1., 0.), (1., 0.)]
The Cohen’s Kappa is then evaluated as follow:
>>> sppasKappa.check_vector(p) >>> True >>> sppasKappa.check_vector(q) >>> True >>> kappa = sppasKappa(p, q) >>> kappa.evaluate() >>> 0.54545
- __init__(p=[], q=[])[source]¶
Create a sppasKappa instance with two lists of tuples p and q.
>>> p=[(1., 0.), (1., 0.), (0.8, 0.2)]
- Parameters
p – a vector of tuples of float values
q – a vector of tuples of float values
- static check_vector(v)[source]¶
Check if the vector is correct to be used.
- Parameters
v – a vector of tuples of probabilities.
- evaluate()[source]¶
Estimate the Cohen’s Kappa between two lists of tuples p and q.
The tuple size corresponds to the number of categories, each value is the score assigned to each category for a given sample.
- Returns
float value
- class calculus.sppasKullbackLeibler(model=None, observations=None)[source]¶
Bases:
object
Kullback-Leibler distance estimator.
In probability theory and information theory, the Kullback–Leibler divergence (also called relative entropy) is a measure of the difference between two probability distributions P and Q. It is not symmetric in P and Q.
Specifically, the Kullback–Leibler divergence of Q from P, denoted DKL(P‖Q), is a measure of the information gained when one revises ones beliefs from the prior probability distribution Q to the posterior probability distribution P.
However, the sppasKullbackLeibler class estimates the KL distance, i.e. the symmetric Kullback-Leibler divergence.
This sppasKullbackLeibler class implements the distance estimation between a model and the content of a moving window on data, as described in:
Brigitte Bigi, Renato De Mori, Marc El-Bèze, Thierry Spriet (1997). Combined models for topic spotting and topic-dependent language modeling IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings (ASRU), Edited by S. Furui, B. H. Huang and Wu Chu, IEEE Signal Processing Society Publ, NY, pages 535-542.
This KL distance can also be used to estimate the distance between documents for text categorization, as proposed in:
Brigitte Bigi (2003). Using Kullback-Leibler Distance for Text Categorization. Lecture Notes in Computer Science, Advances in Information Retrieval, ISSN 0302-9743, Fabrizio Sebastiani (Editor), Springer-Verlag (Publisher), pages 305–319, Pisa (Italy).
In this class…
A model is a dictionary with:
key is an n-gram,
value is a probability.
The window of observed symbols is represented as a list of n-grams.
- DEFAULT_EPSILON = 1e-06¶
- __init__(model=None, observations=None)[source]¶
Create a sppasKullbackLeibler instance from a list of symbols.
- Parameters
model – (dict) a dictionary with key=item, value=probability
observations – list ob observed items
- set_epsilon(eps)[source]¶
Fix the linear back-off value for unknown observations.
The optimal value for this coefficient is the product of the size of both model and observations to estimate the KL. This value must be significantly lower than the minimum value in the model.
- Parameters
eps – (float) Epsilon value.
If eps is set to 0, a default value will be assigned.
- set_model(model)[source]¶
Set the model.
- Parameters
model – (dict) Probability distribution of the model.
- calculus.squared_euclidian(x, y)[source]¶
Estimate the Squared Euclidian distance between two tuples.
- Parameters
x – a tuple of float values
y – a tuple of float values
- Returns
(float)
x and y must have the same length.
>>> x = (1.0, 0.0) >>> y = (0.0, 1.0) >>> squared_euclidian(x, y) >>> 2.0
- calculus.symbols_to_items(symbols, ngram)[source]¶
Convert a list of symbols into a dictionary of items.
Example:
>>>symbols=[0, 1, 0, 1, 1, 1, 0] >>>print symbols_to_items(symbols,2) >>>{(0, 1): 2, (1, 0): 2, (1, 1): 2}
- Returns
dictionary with key=tuple of symbols, value=number of occurrences
- calculus.tansey_linear_regression(points)[source]¶
Linear regression, as proposed in AnnotationPro.
Translated from C# code from here: https://gist.github.com/tansey/1375526
- Parameters
points – a list of tuples (x,y) of float values.
- Returns
intercept, slope
- calculus.tga_linear_regression(points)[source]¶
Linear regression as proposed in TGA, by Dafydd Gibbon.
http://wwwhomes.uni-bielefeld.de/gibbon/TGA/
- Parameters
points – a list of tuples (x,y) of float values.
- Returns
intercept, slope
- calculus.ubpa(vector, text, fp=<colorama.ansitowin32.StreamWrapper object>, delta_max=0.04, step=0.01)[source]¶
Estimate the Unit Boundary Positioning Accuracy.
- Parameters
vector – contains the list of the delta values.
text – one of “Duration”, “Position Start”, …
fp – a file pointer
delta_max – Maximum delta duration to print result (default: 40ms)
step – Delta time (default: 10ms)
- Returns
(tab_neg, tab_pos) with number of occurrences of each position