R to Python t.test function conversion - python

can someone help me transform this R t.test function to python?
r code:
t.test(y, mu = 85, paired = FALSE, var.equal =TRUE, alternative = "greater)

You are testing a single sample x against a population mean mu, so the corresponding function from SciPy is scipy.stats.ttest_1samp. When a second sample y is not given to t.test, var_equal and paired are not relevant, so the only other parameter to deal with is alternative, and the SciPy function also takes an alternative parameter. So the Python code is
from scipy.stats import ttest_1samp
result = ttest_1samp(y, mu, alternative='greater')
Note that ttest_1samp returns only the t statistic (result.statistic) and the p-value (result.pvalue).
For example, here is a calculation in R:
> x = c(3, 1, 4, 1, 5, 9)
> result = t.test(x, mu=2, alternative='greater')
> result$statistic
t
1.49969
> result$p.value
[1] 0.09699043
Here's the corresponding calculation in Python
In [14]: x = [3, 1, 4, 1, 5, 9]
In [15]: result = ttest_1samp(x, 2, alternative='greater')
In [16]: result.statistic
Out[16]: 1.499690178660333
In [17]: result.pvalue
Out[17]: 0.0969904256712105

You may find this blog useful: https://www.reneshbedre.com/blog/ttest.html
This is below an example of conversion with bioinfokit package but you can use the scipy one.
# Perform one sample t-test using bioinfokit,
# Doc: https://github.com/reneshbedre/bioinfokit
from bioinfokit.analys import stat
from bioinfokit.analys import get_data
df = get_data("t_one_samp").data #replace this with your data file
res = stat()
res.ttest(df=df, test_type=1, res='size', mu=5,evar=True)
print(res.summary)
Out put :
One Sample t-test
------------------ --------
Sample size 50
Mean 5.05128
t 0.36789
Df 49
P-value (one-tail) 0.35727
P-value (two-tail) 0.71454
Lower 95.0% 4.77116
Upper 95.0% 5.3314
------------------ --------

Related

Python Jackknife Approach: Apply a function to a dataframe with one row removed each time

Trying to create a simple jackknife of some regression coefficients. For simplicity, I am including sample data and code I am using. My issue is that the regression coefficients produced by the jackknife are the same, without any variation. Not sure what I am missing out.
# the libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
# generate dataset
df = np.random.randint(5, size=(20,4))
df[:10:]
# the regression function
def smREG(data):
X = df[:, 1:]
y = df[:, 0]
mod = sm.OLS(y, sm.add_constant(X)).fit()
return mod.params
# call reg function
smREG(df)
# the jackknife function
def simple_jackknife(data, fn):
jack_sample = {}
jack_reps = {}
for i in range(len(data)):
# delete row 'i' from df
jack_sample[i] = np.delete(arr=data, obj=i, axis=0)
# run function with row 'i' deleted
jack_reps[i] = fn(jack_sample[i])
return jack_reps
# call jackknife function
out = simple_jackknife(data=df, fn=smREG)
out
# just showing the first three of sample output
{0: array([ 2.33483583, -0.06541933, -0.00764364, -0.16846914]),
1: array([ 2.33483583, -0.06541933, -0.00764364, -0.16846914]),
2: array([ 2.33483583, -0.06541933, -0.00764364, -0.16846914])}
My issue is that, I am not sure why the regression coefficients are not changing, even if infinitesimally, after a row is deleted.

Equivalent of numpy.digitize in tensorflow

I am working on a customised loss function that uses numpy.digitize() internally. The loss is minimised for a set of parameters that are the bins values used in digitize method. In order to use the tensorflow optimisers, I would like to know if there an equivalent implementation of digitize in tensorflow? if not is there a good way to implement a workaround?
Here a numpy version:
def fom_func(b, n):
np.where((b > 0) & (n > 0), np.sqrt(2*(n*np.log(np.divide(n,b)) + b - n)),0)
def loss(param, X, y):
param = np.sort(np.asarray(param))
nbins = param.shape[0]
score = 0
y_pred = np.digitize(X, param)
for c in np.arange(nbins):
b = np.where((y==0) & (y_pred==c), 1, 0).sum()
n = np.where((y_pred==c), 1, 0).sum()
score += fom_func(b,n)**2
return -np.sqrt(score)
The equivalent of np.digitize method is called bucketize in TensorFlow, quoting from this api doc:
Bucketizes 'input' based on 'boundaries'.
Summary
For example, if the inputs are boundaries = [0, 10, 100] input = [[-5, 10000] [150, 10] [5, 100]]
then the output will be output = [[0, 3] [3, 2] [1, 3]]
Arguments:
scope: A Scope object
input: Any shape of Tensor contains with int or float type.
boundaries: A sorted list of floats gives the boundary of the buckets.
Returns:
Output: Same shape with 'input', each value of input replaced with bucket index.
(numpy) Equivalent to np.digitize.
I'm not sure why but, this method is hidden in TensorFlow (see the hidden_ops.txt file). So I wouldn't count on it even if you can import it by doing:
from tensorflow.python.ops import math_ops
math_ops._bucketize
this has helped me, you only have to pay attention that the affiliation does not happen to the right or to the left but with regard to the spaces in between the bins:
import tensorflow_probability as tfp
tfp.stats.find_bins()

Obtaining Legendre polynomial form once Legendre coefficients are determined

I have obtained the coefficients for the Legendre polynomial that best fits my data. Now I am needing to determine the value of that polynomial at each time-step of my data. I need to do this so that I can subtract the fit from my data. I have looked at the documentation for the Legendre module, and I'm not sure if I just don't understand my options or if there isn't a native tool in place for what I want. If my data-points were evenly spaced, linspace would be a good option, but that's not the case here. Does anyone have a suggestion for what to try?
For those who would like to demand a minimum working example of code, just use a random array, get the coefficients, and tell me from there how you would proceed. The values themselves don't matter. It's the technique that I'm asking about here. Thanks.
To simplify Ahmed's example
In [1]: from numpy.polynomial import Polynomial, Legendre
In [2]: p = Polynomial([0.5, 0.3, 0.1])
In [3]: x = np.random.rand(10) * 10
In [4]: y = p(x)
In [5]: pfit = Legendre.fit(x, y, 2)
In [6]: plot(*pfit.linspace())
Out[6]: [<matplotlib.lines.Line2D at 0x7f815364f310>]
In [7]: plot(x, y, 'o')
Out[7]: [<matplotlib.lines.Line2D at 0x7f81535d8bd0>]
The Legendre functions are scaled and offset, as the data should be confined to the interval [-1, 1] to get any advantage over the usual power basis. If you want the coefficients for plain old Legendre functions
In [8]: pfit.convert()
Out[8]: Legendre([ 0.53333333, 0.3 , 0.06666667], [-1., 1.], [-1., 1.])
But that isn't recommended.
Once you have a function, you can just generate a numpy array for the timepoints:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
... a,b,c = pfinal # obviously, for a*x^2 + b*x + c
... return (a*bins**2) + b*bins + c
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
It automatically evaluates it for each timepoint is in the numpy array.
Now all you have to do is rewrite mypolynomial to go from a simple quadratic example to a proper one for a Legendre polynomial. Treat the function as if it were evaluating a float to return the value, and when called on the numpy array it will automatically evaluate it for each value.
EDIT:
Let's say I wanted to generalize this to all standard polynomials:
>>> import numpy as np
>>> timepoints = [1,3,7,15,16,17,19]
>>> myarray = np.array(timepoints)
>>> def mypolynomial(bins, pfinal): #pfinal is just the estimate of the final array (i'll do quadratic)
>>> hist = np.zeros((1, len(myarray))) # define blank return
... for i in range(len(pfinal)):
... # fixed a typo here, was pfinal[-i] which would give -0 rather than -1, since negative indexing starts at -1, not -0
... const = pfinal[-i-1] # negative index to go from 0 exponent to highest exponent
... hist += const*(bins**i)
... return hist
>>> mypolynomial(myarray, (1,1,0))
array([ 2, 12, 56, 240, 272, 306, 380])
EDIT2: Typo fix
EDIT3:
#Ahmed is perfect right when he states Homer's rule is good for numerical stability. The implementation here would be as follows:
>>> def horner(coeffs, x):
... acc = 0
... for c in coeffs:
... acc = acc * x + c
... return acc
>>> horner((1,1,0), myarray)
array([ 2, 12, 56, 240, 272, 306, 380])
Slightly modified to keep the same argument order as before, from the code here:
http://rosettacode.org/wiki/Horner%27s_rule_for_polynomial_evaluation#Python
When you're using a nice library to fit polynomials, the library will in my experience usually have a function to evaluate them. So I think it is useful to know how you're generating these coefficients.
In the example below, I used two functions in numpy, legfit and legval which made it trivial to both fit and evaluate the Legendre polynomials without any need to invoke Horner's rule or do the bookkeeping yourself. (Though I do use Horner's rule to generate some example data.)
Here's a complete example where I generate some sparse data from a known polynomial, fit a Legendre polynomial to it, evaluate that polynomial on a dense grid, and plot. Note that the fitting and evaluating part takes three lines thanks to the numpy library doing all the heavy lifting.
It produces the following figure:
import numpy as np
### Setup code
def horner(coeffs, x):
"""Evaluate a polynomial at a point or array"""
acc = 0.0
for c in reversed(coeffs):
acc = acc * x + c
return acc
x = np.random.rand(10) * 10
true_coefs = [0.1, 0.3, 0.5]
y = horner(true_coefs, x)
### Fit and evaluate
legendre_coefs = np.polynomial.legendre.legfit(x, y, 2)
new_x = np.linspace(0, 10)
new_y = np.polynomial.legendre.legval(new_x, legendre_coefs)
### Plotting only
try:
import pylab
pylab.ion() # turn on interactive plotting
pylab.figure()
pylab.plot(x, y, 'o', new_x, new_y, '-')
pylab.xlabel('x')
pylab.ylabel('y')
pylab.title('Fitting Legendre polynomials and evaluating them')
pylab.legend(['original sparse data', 'fit'])
except:
print("Can't start plots.")

Python/Numpy/Scipy: Draw Poisson random values with different lambda

My problem is to extract in the most efficient way N Poisson random values (RV) each with a different mean/rate Lam. Basically the size(RV) == size(Lam).
Here it is a naive (very slow) implementation:
import numpy as NP
def multi_rate_poisson(Lam):
rv = NP.zeros(NP.size(Lam))
for i,lam in enumerate(Lam):
rv[i] = NP.random.poisson(lam=lam, size=1)
return rv
That, on my laptop, with 1e6 samples gives:
Lam = NP.random.rand(1e6) + 1
timeit multi_poisson(Lam)
1 loops, best of 3: 4.82 s per loop
Is it possible to improve from this?
Although the docstrings don't document this functionality, the source indicates it is possible to pass an array to the numpy.random.poisson function.
>>> import numpy
>>> # 1 dimension array of 1M random var's uniformly distributed between 1 and 2
>>> numpyarray = numpy.random.rand(1e6) + 1
>>> # pass to poisson
>>> poissonarray = numpy.random.poisson(lam=numpyarray)
>>> poissonarray
array([4, 2, 3, ..., 1, 0, 0])
The poisson random variable returns discrete multiples of one, and approximates a bell curve as lambda grows beyond one.
>>> import matplotlib.pyplot
>>> count, bins, ignored = matplotlib.pyplot.hist(
numpy.random.poisson(
lam=numpy.random.rand(1e6) + 10),
14, normed=True)
>>> matplotlib.pyplot.show()
This method of passing the array to the poisson generator appears to be quite efficient.
>>> timeit.Timer("numpy.random.poisson(lam=numpy.random.rand(1e6) + 1)",
'import numpy').repeat(3,1)
[0.13525915145874023, 0.12136101722717285, 0.12127304077148438]

strange chi-square result using scikit_learn with feature matrix

I am using scikit learn to calculate the basic chi-square statistics(sklearn.feature_selection.chi2(X, y)):
def chi_square(feat,target):
""" """
from sklearn.feature_selection import chi2
ch,pval = chi2(feat,target)
return ch,pval
chisq,p = chi_square(feat_mat,target_sc)
print(chisq)
print("**********************")
print(p)
I have 1500 samples,45 features,4 classes. The input is a feature matrix with 1500x45 and a target array with 1500 components. The feature matrix is not sparse. When I run the program and I print the arrray "chisq" with 45 components, I can see that the component 13 has a negative value and p = 1. How is it possible? Or what does it mean or what is the big mistake that I am doing?
I am attaching the printouts of chisq and p:
[ 9.17099260e-01 3.77439701e+00 5.35004211e+01 2.17843312e+03
4.27047184e+04 2.23204883e+01 6.49985540e-01 2.02132664e-01
1.57324454e-03 2.16322638e-01 1.85592258e+00 5.70455805e+00
1.34911126e-02 -1.71834753e+01 1.05112366e+00 3.07383691e-01
5.55694752e-02 7.52801686e-01 9.74807972e-01 9.30619466e-02
4.52669897e-02 1.08348058e-01 9.88146259e-03 2.26292358e-01
5.08579194e-02 4.46232554e-02 1.22740419e-02 6.84545170e-02
6.71339545e-03 1.33252061e-02 1.69296016e-02 3.81318236e-02
4.74945604e-02 1.59313146e-01 9.73037448e-03 9.95771327e-03
6.93777954e-02 3.87738690e-02 1.53693158e-01 9.24603716e-04
1.22473138e-01 2.73347277e-01 1.69060817e-02 1.10868365e-02
8.62029628e+00]
**********************
[ 8.21299526e-01 2.86878266e-01 1.43400668e-11 0.00000000e+00
0.00000000e+00 5.59436980e-05 8.84899894e-01 9.77244281e-01
9.99983411e-01 9.74912223e-01 6.02841813e-01 1.26903019e-01
9.99584918e-01 1.00000000e+00 7.88884155e-01 9.58633878e-01
9.96573548e-01 8.60719653e-01 8.07347364e-01 9.92656816e-01
9.97473024e-01 9.90817144e-01 9.99739526e-01 9.73237195e-01
9.96995722e-01 9.97526259e-01 9.99639669e-01 9.95333185e-01
9.99853998e-01 9.99592531e-01 9.99417113e-01 9.98042114e-01
9.97286030e-01 9.83873717e-01 9.99745466e-01 9.99736512e-01
9.95239765e-01 9.97992843e-01 9.84693908e-01 9.99992525e-01
9.89010468e-01 9.64960636e-01 9.99418323e-01 9.99690553e-01
3.47893682e-02]
If you put some print statements in the code
defining
chi2,
def chi2(X, y):
X = atleast2d_or_csr(X)
Y = LabelBinarizer().fit_transform(y)
if Y.shape[1] == 1:
Y = np.append(1 - Y, Y, axis=1)
observed = safe_sparse_dot(Y.T, X) # n_classes * n_features
print(repr(observed))
feature_count = array2d(X.sum(axis=0))
class_prob = array2d(Y.mean(axis=0))
expected = safe_sparse_dot(class_prob.T, feature_count)
print(repr(expected))
return stats.chisquare(observed, expected)
you'll see that expected ends up having some negative
values.
import numpy as np
import sklearn.feature_selection as FS
x = np.array([-0.23918515, -0.29967287, -0.33007592, 0.07383528, -0.09205183,
-0.12548226, 0.04770942, -0.54318463, -0.16833203, -0.00332341,
0.0179646, -0.0526383, 0.04288736, -0.27427317, -0.16136621,
-0.09228812, -0.2255725, -0.03744027, 0.02953499, -0.17387492])
y = np.array([1, 2, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 1, 1, 2, 1, 2, 1, 1],
dtype = 'int64')
FS.chi2(x.reshape(-1,1),y)
yields
observed:
array([[-1.31238179],
[-0.76922812],
[-0.52522003]])
expected:
array([[-1.56409796],
[-0.78204898],
[-0.26068299]])
stats.chisquared(observed, expected) is then called. There, observed
and expected are assumed to be frequencies of categories. They should all be
non-negative numbers since frequencies are non-negative.
I'm not familiar enough with scikits-learn to suggest how your problem should be fixed, but it appears that the kind of data you are sending to chi2 is of the wrong sort, since expected should be non-negative.
(e.g. Could it be that the x values above should all be positive and represent frequencies of observations?)

Categories