What algorithm does Pandas use for computing variance?

What algorithm does Pandas use for computing variance? - python

Which method does Pandas use for computing the variance of a Series?
For example, using Pandas (v0.14.1):
pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145
Obviously due to some numeric instability. However, in R we get:
var(rep(500111,2000000))
0
I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses.
This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Update: To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by #Jeff); whereas if it is not installed, the naive implementation indicated by #BrenBarn is used.

The algorithm can be seen in nanops.py, in the function nanvar, the last line of which is:
return np.fabs((XX - X ** 2 / count) / d)
This is the "naive" implementation at the beginning of the Wikipedia article you mention. (d will be set to N-1 in the default case.)
The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.

I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.
np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999
np.var(repeat(100000000,100000))
0.0
Using Pandas 0.11.0.

Related

Building a filter with Python & MATLAB, results are not the same

I want to translate this MATLAB code into Python, I guess I did everything right, even though I didn't get the same results.
MATLAB script:
n=2 %Filter_Order
Wn=[0.4 0.6] %# Normalized cutoff frequencies
[b,a] = butter(n,Wn,'bandpass') % Transfer function coefficients of the filter
Python script:
import numpy as np
from scipy import signal
n=2 #Filter_Order
Wn=np.array([0.4,0.6]) # Normalized cutoff frequencies
b, a = signal.butter(n, Wn, btype='band') #Transfer function coefficients of the filter
a coefficients in MATLAB: 1, -5.55e-16, 1.14, -1.66e-16, 0.41
a coefficients in Python: 1, -2.77e-16, 1.14, -1.94e-16, 0.41
Could it just be a question of precision, since the two different values (the 2nd and 4th) are both on the order of 10^(-16)?!
The b coefficients are the same on the other hand.

You machine precision is about 1e-16 (in MATLAB this can be checked easily with eps(), I presume about the same in Python). The 'error' you are dealing with is thus on the order of machine precision, i.e. not actually calculable within fitting precision.
Also of note is that MATLAB ~= Python (or != in Python), thus the implementations of butter() on one hand and signal.butter() on the other will be slightly different, even if you use the exact same numbers, due to the way both languages are translated to machine code.
It rarely matters to have coefficients differing 16 orders of magnitude; the smaller ones would be essentially neglected. In case you do need exact values, consider using either symbolic math, or some kind of Variable Precision Arithmetic (vpa() in MATLAB), but I guess that in your case the difference is irrelevant.

the solution of linear combination with integer variables = fixed value

I plan to use python for the solution of next task.
There is an equation:
E=(n[1])*W[1]+ (n[2])*W[2]+..+ (n[N])*W[N]. The W[i],
E are known and are fixed values,
n[i] are integer variables.
I need to find all combinations of n[i] and write them.
Howe can I do it using numpy python?

Looks like a Diophantine equation.
There is no support for this in numpy/scipy and the usual suspect Integer-programming (which can be used to solve this) is also not available within scipy!
The general case is NP-hard!

How to unpack the results from scipy ttest_1samp?

Scipy's ttest_1samp returns a tuple, with the t-statistic, and the two-tailed p-value.
Example:
ttest_1samp([0,1,2], 0) = (array(1.7320508075688774), 0.22540333075851657)
But I'm only interested in the float of the t-test (the t-statistic), which I have only been able to get by using [0].ravel()[0]
Example:
ttest_1samp([0,1,2], 0)[0].ravel()[0] = 1.732
However, I'm quite sure there must be a more pythonic way to do this. What is the best way to get the float from this output?

To expound on #miradulo's answer, if you use a newer version of scipy (release 0.14.0 or later), you can reference the statistic field of the returned namedtuple. Referencing this way is Pythonic and simplifies the code as there is no need to remember specific indices.
Code
res = ttest_1samp(range(3), 0)
print(res.statistic)
print(res.pvalue)
Output
1.73205080757
0.225403330759

From the source code, scipy.stats.ttest_1samp returns nothing more than a namedtuple Ttest_1sampResult with the statistic and p-value. Hence, you do not need to use .ravel - you can simply use
scipy.stats.ttest_1samp([0,1,2], 0)[0]
to access the statistic.
Note:
From a further look at the source code, it is clear that this namedtuple only began being returned in release 0.14.0. In release 0.13.0 and earlier, it appears that a zero dim array is returned (source code), which for all intents and purposes can act just like a plain number as mentioned by BrenBarn.

You can get the results with desired formats this way:
print ("The t-statistic is %.3f and the p-value is %.3f." % stats.ttest_1samp([0,1,2], 0))
Output:
The t-statistic is 1.732 and the p-value is 0.225.

Minimizing an array and value in Python

I have a vector of floats (coming from an operation on an array) and a float value (which is actually an element of the array, but that's unimportant), and I need to find the smallest float out of them all.
I'd love to be able to find the minimum between them in one line in a 'Pythony' way.
MinVec = N[i,:] + N[:,j]
Answer = min(min(MinVec),N[i,j])
Clearly I'm performing two minimisation calls, and I'd love to be able to replace this with one call. Perhaps I could eliminate the vector MinVec as well.
As an aside, this is for a short program in Dynamic Programming.
TIA.
EDIT: My apologies, I didn't specify I was using numpy. The variable N is an array.

You can append the value, then minimize. I'm not sure what the relative time considerations of the two approaches are, though - I wouldn't necessarily assume this is faster:
Answer = min(np.append(MinVec, N[i, j]))

This is the same thing as the answer above but without using numpy.
Answer = min(MinVec.append(N[i, j]))

any faster alternative?

cost=0
for i in range(12):
cost=cost+math.pow(float(float(q[i])-float(w[i])),2)
cost=(math.sqrt(cost))
Any faster alternative to this? i am need to improve my entire code so trying to improve each statements performance.
thanking u

In addition to the general optimization remarks that are already made (and to which I subscribe), there is a more "optimized" way of doing what you want: you manipulate arrays of values and combine them mathematically. This is a job for the very useful and widely used NumPy package!
Here is how you would do it:
q_array = numpy.array(q, dtype=float)
w_array = numpy.array(w, dtype=float)
cost = math.sqrt(((q_array-w_array)**2).sum())
(If your arrays q and w already contain floats, you can remove the dtype=float.)
This is almost as fast as it can get, since NumPy's operations are optimized for arrays. It is also much more legible than a loop, because it is both simple and short.

Just a hint, but usually real performance improvements come when you evaluate the code at a function or even higher level.
During a good evaluation, you may find whole blocks that code be thrown away or rewritten to simplify the process.

Profilers are useful AFTER you've cleaned up crufty not-very-legible code. irrespective of whether it's to be run once or N zillion times, you should not write code like that.
Why are you doing float(q[i]) and float(w[i])? What type(s) is/are the elements of q and `w'?
If x and y are floats, then x - y will be a float too, so that's 3 apparently redundant occurrences of float() already.
Calling math.pow() instead of using the ** operator bears the overhead of lookups on 'math' and 'pow'.
Etc etc
See if the following code gives the same answers and reads better and is faster:
costsq = 0.0
for i in xrange(12):
costsq += (q[i] - w[i]) ** 2
cost = math.sqrt(costsq)
After you've tested that and understood why the changes were made, you can apply the lessons to other Python code. Then if you have a lot more array or matrix work to do, consider using numpy.

Assuming q and w contain numbers the conversions to float are not necessary, otherwise you should convert the lists to a usable representation earlier (and separately from your calculation)
Given that your function seems to only be doing the equivalent of this:
cost = sum( (qi-wi)**2 for qi,wi in zip(q[:12],w) ) ** 0.5
Perhaps this form would execute faster.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What algorithm does Pandas use for computing variance? - python

I don't know the answer, but it seems related to how Series are stored, not necessarily the var function. np.var(pd.Series(repeat(100000000,100000))) 26848.788479999999 np.var(repeat(100000000,100000)) 0.0 Using Pandas 0.11.0.

Related

Building a filter with Python & MATLAB, results are not the same

the solution of linear combination with integer variables = fixed value

How to unpack the results from scipy ttest_1samp?

Minimizing an array and value in Python

any faster alternative?

Categories

Resources