strange chi-square result using scikit_learn with feature matrix

strange chi-square result using scikit_learn with feature matrix - python

I am using scikit learn to calculate the basic chi-square statistics(sklearn.feature_selection.chi2(X, y)):
def chi_square(feat,target):
""" """
from sklearn.feature_selection import chi2
ch,pval = chi2(feat,target)
return ch,pval
chisq,p = chi_square(feat_mat,target_sc)
print(chisq)
print("**********************")
print(p)
I have 1500 samples,45 features,4 classes. The input is a feature matrix with 1500x45 and a target array with 1500 components. The feature matrix is not sparse. When I run the program and I print the arrray "chisq" with 45 components, I can see that the component 13 has a negative value and p = 1. How is it possible? Or what does it mean or what is the big mistake that I am doing?
I am attaching the printouts of chisq and p:
[ 9.17099260e-01 3.77439701e+00 5.35004211e+01 2.17843312e+03
4.27047184e+04 2.23204883e+01 6.49985540e-01 2.02132664e-01
1.57324454e-03 2.16322638e-01 1.85592258e+00 5.70455805e+00
1.34911126e-02 -1.71834753e+01 1.05112366e+00 3.07383691e-01
5.55694752e-02 7.52801686e-01 9.74807972e-01 9.30619466e-02
4.52669897e-02 1.08348058e-01 9.88146259e-03 2.26292358e-01
5.08579194e-02 4.46232554e-02 1.22740419e-02 6.84545170e-02
6.71339545e-03 1.33252061e-02 1.69296016e-02 3.81318236e-02
4.74945604e-02 1.59313146e-01 9.73037448e-03 9.95771327e-03
6.93777954e-02 3.87738690e-02 1.53693158e-01 9.24603716e-04
1.22473138e-01 2.73347277e-01 1.69060817e-02 1.10868365e-02
8.62029628e+00]
**********************
[ 8.21299526e-01 2.86878266e-01 1.43400668e-11 0.00000000e+00
0.00000000e+00 5.59436980e-05 8.84899894e-01 9.77244281e-01
9.99983411e-01 9.74912223e-01 6.02841813e-01 1.26903019e-01
9.99584918e-01 1.00000000e+00 7.88884155e-01 9.58633878e-01
9.96573548e-01 8.60719653e-01 8.07347364e-01 9.92656816e-01
9.97473024e-01 9.90817144e-01 9.99739526e-01 9.73237195e-01
9.96995722e-01 9.97526259e-01 9.99639669e-01 9.95333185e-01
9.99853998e-01 9.99592531e-01 9.99417113e-01 9.98042114e-01
9.97286030e-01 9.83873717e-01 9.99745466e-01 9.99736512e-01
9.95239765e-01 9.97992843e-01 9.84693908e-01 9.99992525e-01
9.89010468e-01 9.64960636e-01 9.99418323e-01 9.99690553e-01
3.47893682e-02]

If you put some print statements in the code
defining
chi2,
def chi2(X, y):
X = atleast2d_or_csr(X)
Y = LabelBinarizer().fit_transform(y)
if Y.shape[1] == 1:
Y = np.append(1 - Y, Y, axis=1)
observed = safe_sparse_dot(Y.T, X) # n_classes * n_features
print(repr(observed))
feature_count = array2d(X.sum(axis=0))
class_prob = array2d(Y.mean(axis=0))
expected = safe_sparse_dot(class_prob.T, feature_count)
print(repr(expected))
return stats.chisquare(observed, expected)
you'll see that expected ends up having some negative
values.
import numpy as np
import sklearn.feature_selection as FS
x = np.array([-0.23918515, -0.29967287, -0.33007592, 0.07383528, -0.09205183,
-0.12548226, 0.04770942, -0.54318463, -0.16833203, -0.00332341,
0.0179646, -0.0526383, 0.04288736, -0.27427317, -0.16136621,
-0.09228812, -0.2255725, -0.03744027, 0.02953499, -0.17387492])
y = np.array([1, 2, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 1, 1, 2, 1, 2, 1, 1],
dtype = 'int64')
FS.chi2(x.reshape(-1,1),y)
yields
observed:
array([[-1.31238179],
[-0.76922812],
[-0.52522003]])
expected:
array([[-1.56409796],
[-0.78204898],
[-0.26068299]])
stats.chisquared(observed, expected) is then called. There, observed
and expected are assumed to be frequencies of categories. They should all be
non-negative numbers since frequencies are non-negative.
I'm not familiar enough with scikits-learn to suggest how your problem should be fixed, but it appears that the kind of data you are sending to chi2 is of the wrong sort, since expected should be non-negative.
(e.g. Could it be that the x values above should all be positive and represent frequencies of observations?)

Related

R to Python t.test function conversion

can someone help me transform this R t.test function to python?
r code:
t.test(y, mu = 85, paired = FALSE, var.equal =TRUE, alternative = "greater)

You are testing a single sample x against a population mean mu, so the corresponding function from SciPy is scipy.stats.ttest_1samp. When a second sample y is not given to t.test, var_equal and paired are not relevant, so the only other parameter to deal with is alternative, and the SciPy function also takes an alternative parameter. So the Python code is
from scipy.stats import ttest_1samp
result = ttest_1samp(y, mu, alternative='greater')
Note that ttest_1samp returns only the t statistic (result.statistic) and the p-value (result.pvalue).
For example, here is a calculation in R:
> x = c(3, 1, 4, 1, 5, 9)
> result = t.test(x, mu=2, alternative='greater')
> result$statistic
t
1.49969
> result$p.value
[1] 0.09699043
Here's the corresponding calculation in Python
In [14]: x = [3, 1, 4, 1, 5, 9]
In [15]: result = ttest_1samp(x, 2, alternative='greater')
In [16]: result.statistic
Out[16]: 1.499690178660333
In [17]: result.pvalue
Out[17]: 0.0969904256712105

You may find this blog useful: https://www.reneshbedre.com/blog/ttest.html
This is below an example of conversion with bioinfokit package but you can use the scipy one.
# Perform one sample t-test using bioinfokit,
# Doc: https://github.com/reneshbedre/bioinfokit
from bioinfokit.analys import stat
from bioinfokit.analys import get_data
df = get_data("t_one_samp").data #replace this with your data file
res = stat()
res.ttest(df=df, test_type=1, res='size', mu=5,evar=True)
print(res.summary)
Out put :
One Sample t-test
------------------ --------
Sample size 50
Mean 5.05128
t 0.36789
Df 49
P-value (one-tail) 0.35727
P-value (two-tail) 0.71454
Lower 95.0% 4.77116
Upper 95.0% 5.3314
------------------ --------

How to get errors for coefficients from a linear regression?

I've been able to calculate the coefficients of a linear regression. But is there a way to get the associated errors of the coefficients? My code shown below.
from scipy.interpolate import *
from numpy import *
x = np.array([4, 12, 56, 58.6,67, 89])
y = np.array([5, 6, 7, 16,18, 19])
degrees = [0,1] # list of degrees of x to use
matrix = np.stack([x**d for d in degrees], axis=-1)
coeff = np.linalg.lstsq(matrix, y)[0]
print("Coefficients", coeff)
fit = np.dot(matrix, coeff)
print("Linear regression", fit)
p1=polyfit(x,y,1)
Output:
Coefficients for y=a +bx [3.70720668 0.17012128]
Linear fit [ 4.38769182 5.74866209 13.23399857 13.67631391 15.10533269 18.84800093]
Errors are not shown! How to calculate the errors?

You can generate the "predicted" values for y, let's call it y_pred, and compare them to y to get the errors.
predicted_line = poly1d(coeff)
y_pred = predicted_line(x)
errors = y-y_pred

Althorugh I like the solution of David Moseler, if you want an error to evaluate the goodness of your regression, you could use the R2 score (which use the squared error), already implemented in sklearn:
from sklearn.linear_model import LinearRegression
import numpy as np
x = np.array([4, 12, 56, 58.6,67, 89]).reshape(-1, 1)
y = np.array([5, 6, 7, 16,18, 19])
reg = LinearRegression().fit(x, y)
reg.score(x, y) # R2 score
# 0.7481301984276703
If the R2 value is near 1, the model is a good one

Robust Linear Model - No exogenous var, just constants

I'm doing a robust linear regression on only a constant (a column of 1s) and no exogenous variable. I'm able to calculate the model just fine by inputting a list of 1's equal to the size of the 'xi_list' from the code snippet below.
def sigma_and_miu(gvkey, statevar_dict):
statevar_list = statevar_dict[gvkey]
xi_list = [np.log(statevar_list[i]) - np.log(statevar_list[i-1]) for i in range(1, len(statevar_list))]
x = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
y = np.array(xi_list)
rlm_model = sm.RLM(y, x, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
sigma = np.std(rlm_results.resid * rlm_results.weights)
miudelta = rlm_results.params[0] + (0.5 * sigma ** 2)
return miudelta, sigma
This function is ran with the following inputs.
dict = {1004:[1796.6, 1938.6, 2085.4, 2009.4, 1906.1, 2002.2, 2164.9, 2478.8, 2357.4, 2662.1, 2911.2, 2400.4, 2535.9, 2812.3, 2873.1, 2775.5, 3374.2, 3345.5, 3466.3, 2409.4]}
key = 1004
miu, sigma = sigma_and_miu(key,dict)
However, I'm looking for a more scalable approach. I was thinking that one solution could be to include a loop that appends as many 1's as the length of the xi_list variable but, this does not seem to be very efficient.
I know there is sm.add_constant() and I tried to add this constant to my 'y' variable and leaving 'x' blank in the sm.RLM() function. This results in not being able to run the model.
So my question is, whether there is a better way to create the list of 1s or should I just go for the loop?

Use basic numpy vectorized computation
e.g.
statevar = np.asarray(statevar_list)
y = np.log(statevar[1:]) - np.log(statevar[:-1])
x = np.ones(len(y))
Aside: The rlm_results should have the robust estimate of the standard deviation that is used in the estimation as a scale attribute.

Python Rbf gives singular matrix error with no duplicate coordinates, why?

Very similar to RBF interpolation fails: LinAlgError: singular matrix but I think the problem is different, as I have no duplicated coordinates.
Toy example:
import numpy as np
import scipy.interpolate as interp
coords = (np.array([-1, 0, 1]), np.array([-2, 0, 2]), np.array([-1, 0, 1]))
coords_mesh = np.meshgrid(*coords, indexing="ij")
fn_value = np.power(coords_mesh[0], 2) + coords_mesh[1]*coords_mesh[2] # F(x, y, z)
coords_array = np.vstack([x.flatten() for x in coords_mesh]).T # Columns are x, y, z
unique_coords_array = np.vstack({tuple(row) for row in coords_array})
unique_coords_array.shape == coords_array.shape # True, i.e. no duplicate coords
my_grid_interp = interp.RegularGridInterpolator(points=coords, values=fn_value)
my_grid_interp(np.array([0, 0, 0])) # Runs without error
my_rbf_interp = interp.Rbf(*[x.flatten() for x in coords_mesh], d=fn_value.flatten())
## Error: numpy.linalg.linalg.LinAlgError: singular matrix -- why?
What am I missing? The example above uses the function F(x, y, z) = x^2 + y*z. I'd like to use Rbf to approximate that function. As far as I can tell there are no duplicate coordinates: compare unique_coords_array to coords_array.

I believe the problem is your input:
my_rbf_interp = interp.Rbf(*[x.flatten() for x in coords_mesh],d=fn_value.flatten())
Should you change to:
x,y,z = [x.flatten() for x in coords_mesh]
my_rbf_interp = interp.Rbf(x,y,z,fn_value.flatten())
And it should work. I think your original formulation is repeating lines in the matrix that goes for solve and thus having a very similar problem to duplicates (i.e. Singular Matrix).
Also if you would do:
d = fn_value.flatten()
my_rbf_interp = interp.Rbf(*(x,y,z,d))
It should work also.

Why does numpy.random.dirichlet() not accept multidimensional arrays?

On the numpy page they give the example of
s = np.random.dirichlet((10, 5, 3), 20)
which is all fine and great; but what if you want to generate random samples from a 2D array of alphas?
alphas = np.random.randint(10, size=(20, 3))
If you try np.random.dirichlet(alphas), np.random.dirichlet([x for x in alphas]), or np.random.dirichlet((x for x in alphas)), it results in a
ValueError: object too deep for desired array. The only thing that seems to work is:
y = np.empty(alphas.shape)
for i in xrange(np.alen(alphas)):
y[i] = np.random.dirichlet(alphas[i])
print y
...which is far from ideal for my code structure. Why is this the case, and can anyone think of a more "numpy-like" way of doing this?
Thanks in advance.

np.random.dirichlet is written to generate samples for a single Dirichlet distribution. That code is implemented in terms of the Gamma distribution, and that implementation can be used as the basis for a vectorized code to generate samples from different distributions. In the following, dirichlet_sample takes an array alphas with shape (n, k), where each row is an alpha vector for a Dirichlet distribution. It returns an array also with shape (n, k), each row being a sample of the corresponding distribution from alphas. When run as a script, it generates samples using dirichlet_sample and np.random.dirichlet to verify that they are generating the same samples (up to normal floating point differences).
import numpy as np
def dirichlet_sample(alphas):
"""
Generate samples from an array of alpha distributions.
"""
r = np.random.standard_gamma(alphas)
return r / r.sum(-1, keepdims=True)
if __name__ == "__main__":
alphas = 2 ** np.random.randint(0, 4, size=(6, 3))
np.random.seed(1234)
d1 = dirichlet_sample(alphas)
print "dirichlet_sample:"
print d1
np.random.seed(1234)
d2 = np.empty(alphas.shape)
for k in range(len(alphas)):
d2[k] = np.random.dirichlet(alphas[k])
print "np.random.dirichlet:"
print d2
# Compare d1 and d2:
err = np.abs(d1 - d2).max()
print "max difference:", err
Sample run:
dirichlet_sample:
[[ 0.38980834 0.4043844 0.20580726]
[ 0.14076375 0.26906604 0.59017021]
[ 0.64223074 0.26099934 0.09676991]
[ 0.21880145 0.33775249 0.44344606]
[ 0.39879859 0.40984454 0.19135688]
[ 0.73976425 0.21467288 0.04556287]]
np.random.dirichlet:
[[ 0.38980834 0.4043844 0.20580726]
[ 0.14076375 0.26906604 0.59017021]
[ 0.64223074 0.26099934 0.09676991]
[ 0.21880145 0.33775249 0.44344606]
[ 0.39879859 0.40984454 0.19135688]
[ 0.73976425 0.21467288 0.04556287]]
max difference: 5.55111512313e-17

I think you're looking for
y = np.array([np.random.dirichlet(x) for x in alphas])
for your list comprehension. Otherwise you're simply passing a python list or tuple. I imagine the reason numpy.random.dirichlet does not accept your list of alpha values is because it's not set up to - it already accepts an array, which it expects to have a dimension of k, as per the documentation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

strange chi-square result using scikit_learn with feature matrix - python

Related

R to Python t.test function conversion

How to get errors for coefficients from a linear regression?

Robust Linear Model - No exogenous var, just constants

Python Rbf gives singular matrix error with no duplicate coordinates, why?

Why does numpy.random.dirichlet() not accept multidimensional arrays?

Categories

Resources