Getting ValueError when expanding my GMMHMM from 2 to three states - python

I am trying to expand my GMMHMM model from two to three states but get the error below
"ValueError: startprob_ must sum to 1 (got nan)"
. It looks like it states that my initial state distribution does not sum up to one but it does (see Pi). Furthermore
I get the following warning, might have something to do with it:
"UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1."
Furthermore, If I look into it I can see that my state transition matrix returns nan values.
import numpy as np
from hmmlearn.hmm import GMMHMM
import pandas as pd
Pi = np.array([0.24, 0.37, 0.39])
A = np.array([[0.74, 0.20, 0.06],
[0.20, 0.53, 0.27 ],
[0.05, 0.40, 0.54]])
model = GMMHMM(n_components=3, n_mix=1, startprob_prior=Pi, transmat_prior=A,
min_covar=0.001, tol=0.0001, n_iter=10000)
Obs = df[['gdp','un','inf','inx','itr']].to_numpy()
print(Obs)
model.fit(Obs)
print(model.transmat_)
seq = model.decode(Obs)
print(seq)
I am not a really experienced Python programmer, so might be an easy solve but unfortunately I do not see how. Any help would be highly appreciated!

Related

Pandas: Create a binary column randomly but with specific proportions

I am trying to create a new random binary column in my table and it needs to have 60% of values as 1 and 40% of values as 0. I have tried to use the np.random.choice function from the numpy package like the following, however, the proportion changes everytime I run my code.
np.random.choice(a = [0,1], size = len(df), p = [0.4, 0.6])
I need to have these proportions fixed. Can anyone help how it can be done? Thank you!
This is how you create an numpy array of size 100 with the distribution of 1 and 0 that you wanted and store it in variable m:
import numpy as np
m = np.random.choice(a = [0,1], size = 100, p = [0.4, 0.6])
I don't know anything about your pandas data frame, because you didn't post your source code here. Therefore I can't tell you, why len(df) is different each time.

Python, seaborn, statistic analysis using statannot doesn't look right

I used statannot to perform a statistical test on some basic data, but the results from the statistical test don't seem correct. I.e. a couple of my comparisons come up with "P_val=0.000e+00 U_stat=0.000e+00", which I think should not be possible. Is there something wrong with my data frame and/or code?
Here is the data frame I am using:
and here is my code:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from statannot import add_stat_annotation
import scipy.stats as sp
data = pd.read_excel('Z:/DMF/GROUPS/gr_Veening/Users/Vik/scRNA-seq/FACSAria/Adherence-invasion assays/adherence_invasion_assay_a549-RFP 4-6-21.xlsx',sheet_name="Sheet2", header = 0)
sns.set_theme(style="darkgrid")
ax1 = sns.boxplot(x="Strain", y="adherence_counts", data=data)
x = "Strain"
y = "adherence_counts"
order = ["D39", "D39 Δcps", "19F", "19F ΔcomCDE"]
ax1 = sns.boxplot(data=data, x=x, y=y, order=order)
plt.title("Adherence Assay")
plt.ylabel('CFU/ml')
plt.xlabel('')
ax1.set(xticklabels=["D39", "D39 Δ$\it{cps}$", "19F", "19F Δ$\it{comCDE}$"])
add_stat_annotation(ax1, data=data, x=x, y=y, order=order,
box_pairs=[("D39", "19F"), ("D39", "D39 Δcps"), ("D39 Δcps", "19F"), ("19F", "19F ΔcomCDE")],
test='Mann-Whitney', text_format='star', loc='inside', verbose=2)
Finally, here is the results from this statistical test:
D39 v.s. D39 Δcps: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=0.000e+00 U_stat=0.000e+00
D39 Δcps v.s. 19F: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=2.000e+00
19F v.s. 19F ΔcomCDE: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=7.617e-01 U_stat=8.000e+00
D39 v.s. 19F: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=0.000e+00 U_stat=0.000e+00
C:\Users\Vik\anaconda3\lib\site-packages\scipy\stats\stats.py:7171: RuntimeWarning: divide by zero encountered in double_scalars
z = (bigu - meanrank) / sd
Any help would be greatly appreciated, thanks!
Your problems come from two parts:
Statistically, in some of your cases (such as "D39" vs "19F"), all items are larger/smaller in one group vs the other, hence the 0 U statistic and extreme p-value. It is very much possible to have these results. It comes from examining only the ranks of the values provided (what this test does), it has advantages and limitations (+ Mann-Whitney's test is not adapted to such small sample sizes either, especially with scipy assuming equivariance).
Now that line z = (bigu - meanrank) / sd failing means that np.sqrt(T * n1 * n2 * (n1+n2+1) / 12.0) = 0, so in this case n1 and/or n2 are 0, (these are len(x) and len(y)). source in scipy So,
There is a bug in statannot, because this can happen, silently, if order and box_pair both refer to a series which does not exist in the dataframe, which I'll correct in statannotations. Thank you, then.
However, I cannot reproduce your Warning with a copy of your dataframe.
If this were the only bug, you should see a missing box in your plot at the point you showed us.
If not, is it possible you updated some of the code but did not copy the last output here? Otherwise, there may be something more to uncover, please let us know.
EDIT: As discovered in the discussion, the second problem can happen in statannot if there is a mismatch between a label in order, box_pairs and in the dataset. This has been patched in statannotations, a fork of statannot.

Summarise the posterior of a single parameter from an array with arviz

I am estimating a model using the pyMC3 library in python. In my "real" model, there are four parameter arrays, two of which have over 170,000 parameters in them. Summarising this array of parameters is too computationally intensive on my computer. I have been trying to figure out if the summary function in arviz will allow me to only summarise one (or a small number) of parameters in the array. Below is a reprex where the same problem is present, though the model is a lot simpler. In the linear regression model below, the parameter array b has three parameters in it b[0], b[1], b[2]. I would like to know how to get the summary for just b[0] and b[1] or alternatively for just a single parameter, e.g., b[0].
import pandas as pd
import pymc3 as pm
import arviz as az
d = pd.read_csv("https://quantoid.net/files/mtcars.csv")
mpg = d['mpg'].values
hp = d['hp'].values
weight = d['wt'].values
with pm.Model() as model:
b = pm.Normal("b", mu=0, sigma=10, shape=3)
sig = pm.HalfCauchy("sig", beta=2)
mu = pm.Deterministic('mu', b[0] + b[1]*hp + b[2]*weight)
like = pm.Normal('like', mu=mu, sigma=sig, observed=mpg)
fit = pm.fit(10000, method='advi')
samp = fit.sample(1500)
with model:
smry = az.summary(samp, var_names = ["b"])
It looked like the coords argument to the summary() function would do it, but after googling around and finding a few examples, like the one here with plot_posterior() instead of summary(), I was unable to get something to work. In particular, I tried the following in the hopes that it would return the summary for b[0] and b[1].
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": range(1)})
or this to return the summary of b[0]:
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": [0]})
I suspect I am missing something simple (I'm an R user who dabbles occasionally with Python). Any help is greatly appreciated.
(BTW, I am using Python 3.8.0, pyMC3 3.9.3, arviz 0.10.0)
To use coords for this, you need to update to the development (which will still show 0.11.2 but has the code from github or any >0.11.2 release) version of ArviZ. Until 0.11.2, the coords argument in summary was not used to subset the data (like it did in all plotting functions) but instead it was only taken into account if the input was not already InferenceData in which case it was passed to the converter.
With older versions, you need to use xarray to subset the data before passing it to summary. Therefore you need to explicitly convert the trace to inferencedata beforehand. In the example above it would look like:
with model:
...
samp = fit.sample(1500)
idata = az.from_pymc3(samp)
az.summary(idata.posterior[["b"]].sel({"b_dim_0": [0]}))
Moreover, you may also want to indicate summary to compute only a subset of the stats/diagnostics as shown in the docstring examples.

Sklearn Lasso regularization not dropping out random variables?

I've been using SelectFromModel from sklearn to reduce features using LASSO regularization and I'm finding that even when I set max_features to quite low (low enough to negatively impact performance) the random variables are often kept.
I generated an example with fake data to illustrate, but I'm seeing similar with actual real data and I am trying to understand why.
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from numpy import random
data = datasets.make_classification(n_features = 20, n_informative = 20, n_redundant = 0,n_samples= 1000, random_state = 3)
X = pd.DataFrame(data[0] )
y = data[1]
X['rand_feat1'] = random.randint(100, size=(X.shape[0]))
X['rand_feat2'] = random.randint(100, size=(X.shape[0]))/100
embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1",
solver='liblinear',random_state = 3),max_features=10)
embeded_lr_selector.fit(X, y)
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')
print('Features kept', embeded_lr_feature)
Even though I've set 20 variables to be informative, and added 2 completely random ones, in many cases this will keep rand_feat2 when selecting the top 10 or even top 5. On a side note I get different results even with random state set....not sure why? But the point is fairly often a random variable will be included as a top 5 feature. I am seeing similar with real world data, where I have to get rid of a huge chunk of the variables to get rid of the random feature....makes me seriously doubt how reliable it is? How do I explain this?
EDIT:
Adding a screenshot along with sklearn/pandas versions printed... I'm not always getting the random features included, but if I run it a few times it will be there. On my real world dataset at least one is almost always included even after removing about half the variables.

Why are LASSO in sklearn (python) and matlab statistical package different?

I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.
I used matlab and replicate the example given in
http://www.mathworks.se/help/stats/lasso-and-elastic-net.html
to get a figure like this
Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got
Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).
Matlab Code:
rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;
save randomData.mat % To be used in python code
[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');
disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)
Python Code:
import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV
data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()
model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_
eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.
Even if you run 2 times the cross-validation in python you can obtain 2 different results.
consider this example :
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
0.00645093258722
0.00691712356467
it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn
another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.
hope this helps
I know this is an old thread, but:
I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).
Try normalizing the X matrix first when using LassoCV.
If it is a pandas object,
(X - X.mean())/X.std()
It seems you also need to multiple alpha by 2
Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.
These are the facts:
Mathworks have selected an example and decided to include it in their documentation
Your matlab code produces exactly the result as the example.
The alternative does not match the result, and has provided inaccurate results in the past
This is my assumption:
The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.
The logical conclusion: Your matlab implementation of this example is reliable and the other is not.
This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.

Categories