# calculus.stats package¶

## calculus.stats.central module¶

filename

sppas.src.calculus.stats.central.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

A collection of basic statistical functions for python.

calculus.stats.central.fgeometricmean(items)[source]

Calculate the geometric mean of the data values.

n-th root of (x1 * x2 * … * xn).

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.central.fharmonicmean(items)[source]

Calculate the harmonic mean of the data values.

C{n / (1/x1 + 1/x2 + … + 1/xn)}.

Parameters

items – (list) list of data values

:returns (float)

calculus.stats.central.fmax(items)[source]

Return the maximum of the data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.central.fmean(items)[source]

Calculate the arithmetic mean of the data values.

sum(items)/len(items)

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.central.fmedian(items)[source]

Calculate the ‘middle’ score of the data values.

If there is an even number of scores, the mean of the 2 middle scores is returned.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.central.fmin(items)[source]

Return the minimum of the data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.central.fmult(items)[source]

Estimate the product of a list of data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.central.fsum(items)[source]

Estimate the sum of a list of data values.

Parameters

items – (list) list of data values

Returns

(float)

## calculus.stats.descriptivesstats module¶

filename

sppas.src.calculus.stats.descriptivesstats.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Descriptive statistics.

class calculus.stats.descriptivesstats.sppasDescriptiveStatistics(dict_items)[source]

Bases: `object`

Descriptive statistics estimator class.

This class estimates descriptive statistics on a set of data values, stored in a dictionary:

• the key is the name of the data set;

• the value is the list of data values for this data set.

```>>> d = {'apples':[1, 2, 3, 4], 'peers':[2, 3, 3, 5]}
>>> s = sppasDescriptiveStatistics(d)
>>> total = s.total()
>>> print(total)
>>> (('peers', 13.0), ('apples', 10.0))
```
__init__(dict_items)[source]

Descriptive statistics.

Parameters

dict_items – a dict of tuples (key, [values])

coefvariation()[source]

Estimate the coefficient of variation of data values.

Returns

(dict) a dictionary of (key, coefvariation) of float

values (given as a percentage).

len()[source]

Estimate the number of occurrences of data values.

Returns

(dict) a dictionary of tuples (key, len)

max()[source]

Estimate the maximum of data values.

Returns

(dict) a dictionary of (key, max) of float values

mean()[source]

Estimate the arithmetic mean of data values.

Returns

(dict) a dictionary of (key, mean) of float values

median()[source]

Estimate the ‘middle’ score of the data values.

Returns

(dict) a dictionary of (key, mean) of float values

min()[source]

Estimate the minimum of data values.

Returns

(dict) a dictionary of (key, min) of float values

stdev()[source]

Estimate the standard deviation of data values.

Returns

(dict) a dictionary of (key, stddev) of float values

total()[source]

Estimate the sum of data values.

Returns

(dict) a dictionary of tuples (key, total) of float values

variance()[source]

Estimate the unbiased sample variance of data values.

Returns

(dict) a dictionary of (key, variance) of float values

zscore()[source]

Estimate the z-scores of data values.

The z-score determines the relative location of a data value.

Returns

(dict) a dictionary of (key, [z-scores]) of float values

## calculus.stats.frequency module¶

filename

sppas.src.calculus.stats.frequency.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

A collection of basic frequency functions for python.

calculus.stats.frequency.freq(mylist, item)[source]

Return the relative frequency of an item of a list.

Parameters
• mylist – (list) list of elements

• item – (any) an element of the list (or not!)

Returns

frequency (float) of item in mylist

calculus.stats.frequency.hapax(mydict)[source]

Return a list of hapax.

Parameters

mydict – (dict)

Returns

list of keys for which value = 1

calculus.stats.frequency.occranks(mydict)[source]

Return a dictionary with key=occurrence, value=rank.

Parameters

mydict – (dict)

Returns

dict

calculus.stats.frequency.percent(mylist, item)[source]

Return the percentage of an item of a list.

Parameters
• mylist – (list) list of elements

• item – (any) an element of the list (or not!)

Returns

percentage (float) of item in mylist

calculus.stats.frequency.percentile(mylist, p=(25, 50, 75), sort=True)[source]

Return the pth percentile of an unsorted or sorted numeric list.

This is equivalent to calling quantile(mylist, p/100.0).

```>>> round(percentile([15, 20, 40, 35, 50], 40), 2)
26.0
>>> for perc in percentile([15, 20, 40, 35, 50], (0, 25, 50, 75, 100)):
...     print("{:.2f}".format(perc))
...
15.00
17.50
35.00
45.00
50.00
```
Parameters
• mylist – (list) list of elements.

• p – (tuple) the percentile we are looking for.

• sort – whether to sort the vector.

Returns

percentile as a float

calculus.stats.frequency.quantile(mylist, q=(0.25, 0.5, 0.75), sort=True)[source]

Return the qth quantile of an unsorted or sorted numeric list.

Calculates a rank n as q(N+1), where N is the number of items in mylist, then splits n into its integer component k and decimal component d. If k <= 1, returns the first element; if k >= N, returns the last element; otherwise returns the linear interpolation between mylist[k-1] and mylist[k] using a factor d.

```>>> round(quantile([15, 20, 40, 35, 50], 0.4), 2)
26.0
```
Parameters
• mylist – (list) list of elements.

• q – (tuple) the quantile we are looking for.

• sort – whether to sort the vector.

Returns

quantile as a float

calculus.stats.frequency.ranks(counter)[source]

Return a dictionary with key=token, value=rank.

Parameters

counter – (collections.Counter)

Returns

dict

calculus.stats.frequency.tfidf(documents, item)[source]

Return the tf.idf of an item.

Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf.idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

Parameters
• documents – a list of list of entries.

• item

Returns

float

calculus.stats.frequency.zipf(dict_ranks, item)[source]

Return the Zipf Law value of an item.

Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Parameters
• dict_ranks – (dict) is a dictionary with key=entry, value=rank.

• item – (any) is an entry of the ranks dictionary

Returns

Zipf value or -1 if the entry is missing

## calculus.stats.linregress module¶

filename

sppas.src.calculus.stats.linregress.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

Linear regression functions for python.

The goal of linear regression is to fit a line to a set of points. Equation of the line is y = mx + b where m is slope, b is y-intercept.

calculus.stats.linregress.compute_error_for_line_given_points(b, m, points)[source]

Error function (also called a cost function).

It measures how “good” a given line is.

This function will take in a (m,b) pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each (x,y) point in our data set and sum the square distances between each point’s y value and the candidate line’s y value (computed at mx + b).

Lines that fit our data better (where better is defined by our error function) will result in lower error values.

Gradient descent is an algorithm that minimizes functions.

Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.

Parameters
• points – a list of tuples (x,y) of float values.

• starting_b – (float)

• starting_m – (float)

• learning_rate – (float)

• num_iterations – (int)

Returns

intercept, slope

Gradient descent method for linear regression.

Parameters
• points – a list of tuples (x,y) of float values.

• num_iterations – (int)

Returns

intercept, slope

One step of a gradient linear regression.

To run gradient descent on an error function, we first need to compute its gradient. The gradient will act like a compass and always point us downhill. To compute it, we will need to differentiate our error function. Since our function is defined by two parameters (m and b), we will need to compute a partial derivative for each.

Each iteration will update m and b to a line that yields slightly lower error than the previous iteration.

The learning_rate variable controls how large of a step we take downhill during each iteration. If we take too large of a step, we may step over the minimum. However, if we take small steps, it will require many iterations to arrive at the minimum.

calculus.stats.linregress.tansey_linear_regression(points)[source]

Linear regression, as proposed in AnnotationPro.

http://annotationpro.org/

Translated from C# code from here: https://gist.github.com/tansey/1375526

Parameters

points – a list of tuples (x,y) of float values.

Returns

intercept, slope

calculus.stats.linregress.tga_linear_regression(points)[source]

Linear regression as proposed in TGA, by Dafydd Gibbon.

http://wwwhomes.uni-bielefeld.de/gibbon/TGA/

Parameters

points – a list of tuples (x,y) of float values.

Returns

intercept, slope

## calculus.stats.moment module¶

filename

sppas.src.calculus.stats.moment.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

A collection of basic statistical functions for python.

calculus.stats.moment.lkurtosis(items)[source]

Return the kurtosis of a distribution.

The kurtosis represents a measure of the “peakedness”: a high kurtosis distribution has a sharper peak and fatter tails, while a low kurtosis distribution has a more rounded peak and thinner tails.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.moment.lmoment(items, moment=1)[source]

Calculate the r-th moment about the mean for a sample.

1/n * SUM((items(i)-mean)**r)

Parameters
• items – (list) list of data values

• moment

Returns

(float)

calculus.stats.moment.lskew(items)[source]

Calculate the skewness of a distribution.

The skewness represents a measure of the asymmetry: an understanding of the skewness of the dataset indicates whether deviations from the mean are going to be positive or negative.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.moment.lvariation(items)[source]

Calculate the coefficient of variation of data values.

It shows the extent of variability in relation to the mean. It’s a standardized measure of dispersion: stdev / mean and returned as a percentage.

Parameters

items – (list) list of data values

Returns

(float)

## calculus.stats.variability module¶

filename

sppas.src.calculus.stats.variability.py

author

Brigitte Bigi

contact

develop@sppas.org

summary

variance estimators.

calculus.stats.variability.lstdev(items)[source]

Calculate the standard deviation of the data values, for a population.

The standard deviation is the positive square root of the variance.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.variability.lsterr(items)[source]

Calculate the standard error of the data values.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.variability.lunbiasedstdev(items)[source]

Calculate the standard deviation of the data values, for a sample.

The standard deviation is the positive square root of the variance.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.variability.lunbiasedvariance(items)[source]

Calculate the unbiased sample variance of the data values, for a sample.

It means that the estimation is using N-1 for the denominator. The variance is a measure of dispersion near the mean.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.variability.lvariance(items)[source]

Calculate the variance of the data values, for a population.

It means that the estimation is using N for the denominator. The variance is a measure of dispersion near the mean.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.variability.lz(items, score)[source]

Calculate the z-score for a given input score.

given that score and the data values from which that score came.

The z-score determines the relative location of a data value.

Parameters
• items – (list) list of data values

• score – (float) a score of any items

Returns

(float)

calculus.stats.variability.lzs(items)[source]

Calculate a list of z-scores, one for each score in the data values.

Parameters

items – (list) list of data values

Returns

(list)

calculus.stats.variability.nPVI(items)[source]

Calculate the Normalized Pairwise Variability Index.

Parameters

items – (list) list of data values

Returns

(float)

calculus.stats.variability.rPVI(items)[source]

Calculate the Raw Pairwise Variability Index.

Parameters

items – (list) list of data values

Returns

(float)