Scipy's ttest_1samp returns a tuple, with the t-statistic, and the two-tailed p-value.
Example:
ttest_1samp([0,1,2], 0) = (array(1.7320508075688774), 0.22540333075851657)
But I'm only interested in the float of the t-test (the t-statistic), which I have only been able to get by using [0].ravel()[0]
Example:
ttest_1samp([0,1,2], 0)[0].ravel()[0] = 1.732
However, I'm quite sure there must be a more pythonic way to do this. What is the best way to get the float from this output?
To expound on #miradulo's answer, if you use a newer version of scipy (release 0.14.0 or later), you can reference the statistic field of the returned namedtuple. Referencing this way is Pythonic and simplifies the code as there is no need to remember specific indices.
Code
res = ttest_1samp(range(3), 0)
print(res.statistic)
print(res.pvalue)
Output
1.73205080757
0.225403330759
From the source code, scipy.stats.ttest_1samp returns nothing more than a namedtuple Ttest_1sampResult with the statistic and p-value. Hence, you do not need to use .ravel - you can simply use
scipy.stats.ttest_1samp([0,1,2], 0)[0]
to access the statistic.
Note:
From a further look at the source code, it is clear that this namedtuple only began being returned in release 0.14.0. In release 0.13.0 and earlier, it appears that a zero dim array is returned (source code), which for all intents and purposes can act just like a plain number as mentioned by BrenBarn.
You can get the results with desired formats this way:
print ("The t-statistic is %.3f and the p-value is %.3f." % stats.ttest_1samp([0,1,2], 0))
Output:
The t-statistic is 1.732 and the p-value is 0.225.
Related
scipy.stats.spearmanr([1,2,3,4,1],[1,2,2,1,np.nan],nan_policy='omit')
it will give a spearman correlation of 0.349999
My understanding is that nan_policy ='omit' will discard all the pairs which have nan. If that's the case, the results should be the same as scipy.stats.spearmanr([1,2,3,4],[1,2,2,1])
However, it gives a correlation of 0.235702.
Why are they different? Is my understand of nan_policy ='omit' corrent?
I tried to run your code, it gives me cero correlation (R=0.0).
I use this function and you are understanding well nan_policy ='omit'.
If you don't need the p-value of the correlation I would sugest using .corr(method = 'spearman') from pandas library. By default it excludes NA/null values.
Official Documentation
nan_policy='omit' should completely omit those pairs for which one or both values are nan. When I run the two commands you pasted above, I get the same correlation value, not different ones.
I'm generating a series of random floats using this line:
random.random()*(maxval-minval) + minval
I'm using it to add variable noise to a given variable, and the amount of noise added depends on a series of factors. In some cases, the noise should be so high that in practice the original value is lost, and I have a completely random value.
In this context, the code works works with finite values, but if I use "inf" it returns NaN. Is there a workaround to allow a continuos random range that might include the infinity? I don't want to tamper with os.random() as it is machine-specific.
If you define a uniform random distribution over an infinite domain, the probability of any value in the domain being chosen is infinitesimal. What you're asking for doesn't make any mathematical sense.
As it was said before, you can't have uniform distribution over the whole real line, but you can use other random distributions which have real line support. Consider Cauchy distribution. It has 'heavy-tails', which simply means that there is a decent probability of getting very big numbers.
After the discussion in comments i suggest the following :
>>> m=sys.maxint
>>> np.random.uniform(-m,m,5)
array([ -5.32362215e+18, -2.90131323e+18, 5.14492175e+18,
-5.64238742e+18, -3.49640768e+18])
As is said the you can get the max integer with sys.maxint then you can use np.random.randint to get a random number between the maxint and -maxint.
As #Asad says, what you are trying is mathematically not quite sound. But what you could do, is the following:
define a very big number (maybe this post helps: What is the range of values a float can have in Python?)
use random.uniform(0, biggestValue) as an approximation for random values according to your needs.
Maybe this is what you are looking for.
Which method does Pandas use for computing the variance of a Series?
For example, using Pandas (v0.14.1):
pandas.Series(numpy.repeat(500111,2000000)).var()
12.579462289731145
Obviously due to some numeric instability. However, in R we get:
var(rep(500111,2000000))
0
I wasn't able to make enough sense of the Pandas source-code to figure out what algorithm it uses.
This link may be useful: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Update: To summarize the comments below - If the Python bottleneck package for fast NumPy array functions is installed, a stabler two-pass algorithm similar to np.sqrt(((arr - arr.mean())**2).mean()) is used and gives 0.0 (as indicated by #Jeff); whereas if it is not installed, the naive implementation indicated by #BrenBarn is used.
The algorithm can be seen in nanops.py, in the function nanvar, the last line of which is:
return np.fabs((XX - X ** 2 / count) / d)
This is the "naive" implementation at the beginning of the Wikipedia article you mention. (d will be set to N-1 in the default case.)
The behavior you're seeing appears to be due to the sum of squared values overflowing the numpy datatypes. It's not an issue of how the variance is calculated per se.
I don't know the answer, but it seems related to how Series are stored, not necessarily the var function.
np.var(pd.Series(repeat(100000000,100000)))
26848.788479999999
np.var(repeat(100000000,100000))
0.0
Using Pandas 0.11.0.
I have a vector of floats (coming from an operation on an array) and a float value (which is actually an element of the array, but that's unimportant), and I need to find the smallest float out of them all.
I'd love to be able to find the minimum between them in one line in a 'Pythony' way.
MinVec = N[i,:] + N[:,j]
Answer = min(min(MinVec),N[i,j])
Clearly I'm performing two minimisation calls, and I'd love to be able to replace this with one call. Perhaps I could eliminate the vector MinVec as well.
As an aside, this is for a short program in Dynamic Programming.
TIA.
EDIT: My apologies, I didn't specify I was using numpy. The variable N is an array.
You can append the value, then minimize. I'm not sure what the relative time considerations of the two approaches are, though - I wouldn't necessarily assume this is faster:
Answer = min(np.append(MinVec, N[i, j]))
This is the same thing as the answer above but without using numpy.
Answer = min(MinVec.append(N[i, j]))
beside the correct language ID langid.py returns a certain value - "The value returned is a score for the language. It is not a probability esimate, as it is not normalized by the document probability since this is unnecessary for classification."
But what does the value mean??
I'm actually the author of langid.py. Unfortunately, I've only just spotted this question now, almost a year after it was asked. I've tidied up the handling of the normalization since this question was asked, so all the README examples have been updated to show actual probabilities.
The value that you see there (and that you can still get by turning normalization off) is the un-normalized log-probability of the document. Because log/exp are monotonic, we don't actually need to compute the probability to decide the most likely class. The actual value of this log-prob is not actually of any use to the user. I should probably have never included it, and I may remove its output in the future.
I think this is the important chunk of langid.py code:
def nb_classify(fv):
# compute the log-factorial of each element of the vector
logfv = logfac(fv).astype(float)
# compute the probability of the document given each class
pdc = np.dot(fv,nb_ptc) - logfv.sum()
# compute the probability of the document in each class
pd = pdc + nb_pc
# select the most likely class
cl = np.argmax(pd)
# turn the pd into a probability distribution
pd /= pd.sum()
return cl, pd[cl]
It looks to me that the author is calculating something like the multinomial log-posterior of the data for each of the possible languages. logfv calculates the logarithm of the denominator of the PMF (x_1!...x_k!). np.dot(fv,nb_ptc) calculates the
logarithm of the p_1^x_1...p_k^x_k term. So, pdc looks like the list of language conditional log-likelihoods (except that it's missing the n! term). nb_pc looks like the prior probabilities, so pd would be the log-posteriors. The normalization line, pd /= pd.sum() confuses me, since one usually normalizes probability-like values (not log-probability values); also, the examples in the documentation (('en', -55.106250761034801)) don't look like they've been normalized---maybe they were generated before the normalization line was added?
Anyway, the short answer is that this value, pd[cl] is a confidence score. My understanding based on the current code is that they should be values between 0 and 1/97 (since there are 97 languages), with a smaller value indicating higher confidence.
Looks like a value that tells you how certain the engine is that it guessed the correct language for the document. I think generally the closer to 0 the number, the more sure it is, but you should be able to test that by mixing languages together and passing them in to see what values you get out. It allows you to fine tune your program when using langid depending upon what you consider 'close enough' to count as a match.