calculus.stats package¶
Submodules¶
calculus.stats.central module¶
- filename
sppas.src.calculus.stats.central.py
- author
Brigitte Bigi
- contact
- summary
A collection of basic statistical functions for python.
- calculus.stats.central.fgeometricmean(items)[source]¶
Calculate the geometric mean of the data values.
n-th root of (x1 * x2 * … * xn).
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.central.fharmonicmean(items)[source]¶
Calculate the harmonic mean of the data values.
C{n / (1/x1 + 1/x2 + … + 1/xn)}.
- Parameters
items – (list) list of data values
:returns (float)
- calculus.stats.central.fmax(items)[source]¶
Return the maximum of the data values.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.central.fmean(items)[source]¶
Calculate the arithmetic mean of the data values.
sum(items)/len(items)
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.central.fmedian(items)[source]¶
Calculate the ‘middle’ score of the data values.
If there is an even number of scores, the mean of the 2 middle scores is returned.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.central.fmin(items)[source]¶
Return the minimum of the data values.
- Parameters
items – (list) list of data values
- Returns
(float)
calculus.stats.descriptivesstats module¶
- filename
sppas.src.calculus.stats.descriptivesstats.py
- author
Brigitte Bigi
- contact
- summary
Descriptive statistics.
- class calculus.stats.descriptivesstats.sppasDescriptiveStatistics(dict_items)[source]¶
Bases:
object
Descriptive statistics estimator class.
This class estimates descriptive statistics on a set of data values, stored in a dictionary:
the key is the name of the data set;
the value is the list of data values for this data set.
>>> d = {'apples':[1, 2, 3, 4], 'peers':[2, 3, 3, 5]} >>> s = sppasDescriptiveStatistics(d) >>> total = s.total() >>> print(total) >>> (('peers', 13.0), ('apples', 10.0))
- __init__(dict_items)[source]¶
Descriptive statistics.
- Parameters
dict_items – a dict of tuples (key, [values])
- coefvariation()[source]¶
Estimate the coefficient of variation of data values.
- Returns
(dict) a dictionary of (key, coefvariation) of float
values (given as a percentage).
- len()[source]¶
Estimate the number of occurrences of data values.
- Returns
(dict) a dictionary of tuples (key, len)
- max()[source]¶
Estimate the maximum of data values.
- Returns
(dict) a dictionary of (key, max) of float values
- mean()[source]¶
Estimate the arithmetic mean of data values.
- Returns
(dict) a dictionary of (key, mean) of float values
- median()[source]¶
Estimate the ‘middle’ score of the data values.
- Returns
(dict) a dictionary of (key, mean) of float values
- min()[source]¶
Estimate the minimum of data values.
- Returns
(dict) a dictionary of (key, min) of float values
- stdev()[source]¶
Estimate the standard deviation of data values.
- Returns
(dict) a dictionary of (key, stddev) of float values
- total()[source]¶
Estimate the sum of data values.
- Returns
(dict) a dictionary of tuples (key, total) of float values
calculus.stats.frequency module¶
- filename
sppas.src.calculus.stats.frequency.py
- author
Brigitte Bigi
- contact
- summary
A collection of basic frequency functions for python.
- calculus.stats.frequency.freq(mylist, item)[source]¶
Return the relative frequency of an item of a list.
- Parameters
mylist – (list) list of elements
item – (any) an element of the list (or not!)
- Returns
frequency (float) of item in mylist
- calculus.stats.frequency.hapax(mydict)[source]¶
Return a list of hapax.
- Parameters
mydict – (dict)
- Returns
list of keys for which value = 1
- calculus.stats.frequency.occranks(mydict)[source]¶
Return a dictionary with key=occurrence, value=rank.
- Parameters
mydict – (dict)
- Returns
dict
- calculus.stats.frequency.percent(mylist, item)[source]¶
Return the percentage of an item of a list.
- Parameters
mylist – (list) list of elements
item – (any) an element of the list (or not!)
- Returns
percentage (float) of item in mylist
- calculus.stats.frequency.percentile(mylist, p=(25, 50, 75), sort=True)[source]¶
Return the pth percentile of an unsorted or sorted numeric list.
This is equivalent to calling quantile(mylist, p/100.0).
>>> round(percentile([15, 20, 40, 35, 50], 40), 2) 26.0 >>> for perc in percentile([15, 20, 40, 35, 50], (0, 25, 50, 75, 100)): ... print("{:.2f}".format(perc)) ... 15.00 17.50 35.00 45.00 50.00
- Parameters
mylist – (list) list of elements.
p – (tuple) the percentile we are looking for.
sort – whether to sort the vector.
- Returns
percentile as a float
- calculus.stats.frequency.quantile(mylist, q=(0.25, 0.5, 0.75), sort=True)[source]¶
Return the qth quantile of an unsorted or sorted numeric list.
Calculates a rank n as q(N+1), where N is the number of items in mylist, then splits n into its integer component k and decimal component d. If k <= 1, returns the first element; if k >= N, returns the last element; otherwise returns the linear interpolation between mylist[k-1] and mylist[k] using a factor d.
>>> round(quantile([15, 20, 40, 35, 50], 0.4), 2) 26.0
- Parameters
mylist – (list) list of elements.
q – (tuple) the quantile we are looking for.
sort – whether to sort the vector.
- Returns
quantile as a float
- calculus.stats.frequency.ranks(counter)[source]¶
Return a dictionary with key=token, value=rank.
- Parameters
counter – (collections.Counter)
- Returns
dict
- calculus.stats.frequency.tfidf(documents, item)[source]¶
Return the tf.idf of an item.
Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf.idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
- Parameters
documents – a list of list of entries.
item –
- Returns
float
- calculus.stats.frequency.zipf(dict_ranks, item)[source]¶
Return the Zipf Law value of an item.
Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
- Parameters
dict_ranks – (dict) is a dictionary with key=entry, value=rank.
item – (any) is an entry of the ranks dictionary
- Returns
Zipf value or -1 if the entry is missing
calculus.stats.linregress module¶
- filename
sppas.src.calculus.stats.linregress.py
- author
Brigitte Bigi
- contact
- summary
Linear regression functions for python.
The goal of linear regression is to fit a line to a set of points. Equation of the line is y = mx + b where m is slope, b is y-intercept.
- calculus.stats.linregress.compute_error_for_line_given_points(b, m, points)[source]¶
Error function (also called a cost function).
It measures how “good” a given line is.
This function will take in a (m,b) pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each (x,y) point in our data set and sum the square distances between each point’s y value and the candidate line’s y value (computed at mx + b).
Lines that fit our data better (where better is defined by our error function) will result in lower error values.
- calculus.stats.linregress.gradient_descent(points, starting_b, starting_m, learning_rate, num_iterations)[source]¶
Gradient descent is an algorithm that minimizes functions.
Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.
- Parameters
points – a list of tuples (x,y) of float values.
starting_b – (float)
starting_m – (float)
learning_rate – (float)
num_iterations – (int)
- Returns
intercept, slope
- calculus.stats.linregress.gradient_descent_linear_regression(points, num_iterations=50000)[source]¶
Gradient descent method for linear regression.
adapted from: http://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/
- Parameters
points – a list of tuples (x,y) of float values.
num_iterations – (int)
- Returns
intercept, slope
- calculus.stats.linregress.step_gradient(b_current, m_current, points, learning_rate)[source]¶
One step of a gradient linear regression.
To run gradient descent on an error function, we first need to compute its gradient. The gradient will act like a compass and always point us downhill. To compute it, we will need to differentiate our error function. Since our function is defined by two parameters (m and b), we will need to compute a partial derivative for each.
Each iteration will update m and b to a line that yields slightly lower error than the previous iteration.
The learning_rate variable controls how large of a step we take downhill during each iteration. If we take too large of a step, we may step over the minimum. However, if we take small steps, it will require many iterations to arrive at the minimum.
- calculus.stats.linregress.tansey_linear_regression(points)[source]¶
Linear regression, as proposed in AnnotationPro.
Translated from C# code from here: https://gist.github.com/tansey/1375526
- Parameters
points – a list of tuples (x,y) of float values.
- Returns
intercept, slope
- calculus.stats.linregress.tga_linear_regression(points)[source]¶
Linear regression as proposed in TGA, by Dafydd Gibbon.
http://wwwhomes.uni-bielefeld.de/gibbon/TGA/
- Parameters
points – a list of tuples (x,y) of float values.
- Returns
intercept, slope
calculus.stats.moment module¶
- filename
sppas.src.calculus.stats.moment.py
- author
Brigitte Bigi
- contact
- summary
A collection of basic statistical functions for python.
- calculus.stats.moment.lkurtosis(items)[source]¶
Return the kurtosis of a distribution.
The kurtosis represents a measure of the “peakedness”: a high kurtosis distribution has a sharper peak and fatter tails, while a low kurtosis distribution has a more rounded peak and thinner tails.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.moment.lmoment(items, moment=1)[source]¶
Calculate the r-th moment about the mean for a sample.
1/n * SUM((items(i)-mean)**r)
- Parameters
items – (list) list of data values
moment –
- Returns
(float)
- calculus.stats.moment.lskew(items)[source]¶
Calculate the skewness of a distribution.
The skewness represents a measure of the asymmetry: an understanding of the skewness of the dataset indicates whether deviations from the mean are going to be positive or negative.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.moment.lvariation(items)[source]¶
Calculate the coefficient of variation of data values.
It shows the extent of variability in relation to the mean. It’s a standardized measure of dispersion: stdev / mean and returned as a percentage.
- Parameters
items – (list) list of data values
- Returns
(float)
calculus.stats.variability module¶
- filename
sppas.src.calculus.stats.variability.py
- author
Brigitte Bigi
- contact
- summary
variance estimators.
- calculus.stats.variability.lstdev(items)[source]¶
Calculate the standard deviation of the data values, for a population.
The standard deviation is the positive square root of the variance.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.variability.lsterr(items)[source]¶
Calculate the standard error of the data values.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.variability.lunbiasedstdev(items)[source]¶
Calculate the standard deviation of the data values, for a sample.
The standard deviation is the positive square root of the variance.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.variability.lunbiasedvariance(items)[source]¶
Calculate the unbiased sample variance of the data values, for a sample.
It means that the estimation is using N-1 for the denominator. The variance is a measure of dispersion near the mean.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.variability.lvariance(items)[source]¶
Calculate the variance of the data values, for a population.
It means that the estimation is using N for the denominator. The variance is a measure of dispersion near the mean.
- Parameters
items – (list) list of data values
- Returns
(float)
- calculus.stats.variability.lz(items, score)[source]¶
Calculate the z-score for a given input score.
given that score and the data values from which that score came.
The z-score determines the relative location of a data value.
- Parameters
items – (list) list of data values
score – (float) a score of any items
- Returns
(float)
- calculus.stats.variability.lzs(items)[source]¶
Calculate a list of z-scores, one for each score in the data values.
- Parameters
items – (list) list of data values
- Returns
(list)