I am having a strange issue, i am using a RandomizedSearchCV to optimize my parameters.
para_RS = {"max_depth": randint(1,70),
"max_features": ["log2", "sqrt"],
"min_samples_leaf": randint(5, 50),
"criterion": ["entropy","gini"],
"class_weight":['balanced'],
"max_leaf_nodes":randint(2,20)
}
dt = DecisionTreeClassifier()
if i include all these parameters, the output comes in 2-3 mins, however if i remove all the parameters and keeping only the below parameter, its taking forever to run and i have to kill the notebook
para_RS = {
"max_depth": randint(1,70)
}
and also if i remove fewer its take long time to run(5-10 mins).
below is the code:
if (randomsearch == True):
tick = time.time()
print("Random_Search_begin")
rs= RandomizedSearchCV(estimator=dt, cv=5, param_distributions=para_RS,
n_jobs=4,n_iter =30, scoring="roc_auc",return_train_score=True)
rs.fit(trainx_outer,trainy_outer)
# other code irrelevant to the issue...
print("Random_Search_end")
This is due to random nature of the following:
"max_depth": randint(1,70)
"max_leaf_nodes":randint(2,20)
randint(1, 70) will return an integer between 1, 70. So during different runs, a different value of max_depth is generated.
So it may happen that during a certain run, the value generated is very high. The speed of DecisionTreeClassifier is impacted by the value of max_depth and it max_leaf_nodes. If these are very large, the time will be very large.
Also, I am not sure how you are able to run this code. Because RandomizedSearchCV takes a parameter grid of dictionary of iterables. But your code will generate a single int for "max_depth", "max_leaf_nodes" instead of an array or iterable. So it should throw an error. Which version of sklearn are you using? Or is the code you shown here different than actual?
you can close this, as it seems the problem went away when i started using the random seed in both the classifier and RandomSearchCV. Thanks for all the help.
Related
i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers.
https://i.stack.imgur.com/sirSl.png <-dataset example
Next,here's my first chunk of code where i assign the train and test groups:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn
data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")
#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]
print(data)
predict = "G3"
x = np.array(data.drop([predict], axis=1))
y = np.array(data[predict])
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))
That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.
Here's the second chunk of code when i do the cross_val:
#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```
the code is pretty self explanatory as you can tell
The warning:
The least populated class in y has only 1 members, which is less than n_splits=10.
happens in every iteration of the for-loop
Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.
my question is :
How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?
I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.
I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i can´t understand whats wrong.
I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.
In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.
After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.
This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.
With this simple changes, your problem will be solved.
I've been using SelectFromModel from sklearn to reduce features using LASSO regularization and I'm finding that even when I set max_features to quite low (low enough to negatively impact performance) the random variables are often kept.
I generated an example with fake data to illustrate, but I'm seeing similar with actual real data and I am trying to understand why.
import pandas as pd
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from numpy import random
data = datasets.make_classification(n_features = 20, n_informative = 20, n_redundant = 0,n_samples= 1000, random_state = 3)
X = pd.DataFrame(data[0] )
y = data[1]
X['rand_feat1'] = random.randint(100, size=(X.shape[0]))
X['rand_feat2'] = random.randint(100, size=(X.shape[0]))/100
embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1",
solver='liblinear',random_state = 3),max_features=10)
embeded_lr_selector.fit(X, y)
embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')
print('Features kept', embeded_lr_feature)
Even though I've set 20 variables to be informative, and added 2 completely random ones, in many cases this will keep rand_feat2 when selecting the top 10 or even top 5. On a side note I get different results even with random state set....not sure why? But the point is fairly often a random variable will be included as a top 5 feature. I am seeing similar with real world data, where I have to get rid of a huge chunk of the variables to get rid of the random feature....makes me seriously doubt how reliable it is? How do I explain this?
EDIT:
Adding a screenshot along with sklearn/pandas versions printed... I'm not always getting the random features included, but if I run it a few times it will be there. On my real world dataset at least one is almost always included even after removing about half the variables.
I have fit a Kmeans model on document embeddings from a Doc2Vec model to cluster the embeddings and get a visualization as well as the most frequent terms per cluster. I have been able to do this fine and get the same visualization each time.
When I run the kmeans.fit_predict on the model it gives me a list of cluster labels according to the clusters I have specified of the same length as the number of document embeddings I have. The issue comes when running the model multiple times it gives a similar spread per cluster each time but the cluster labels will change after running it multiple times. For example,
Run 1 - 0:100, 1:100, 2:10
Run 2 - 0:99 , 1:101, 2:10
Run 3 - 2:100, 0:100, 1:10
Run 4 - 0:100, 1:100, 2:10
I tried saving the model and using the same model multiple times but encountered the same issue. This causes the most frequent terms per cluster and position of the cluster in the visualization to change, which changes the way it is interpreted. I was planning to use the labels as a classification method but doesn't this make that impossible? I'm not sure if its an issue with my code or if this is normal behavior if anyone can help it would be much appreciated.
df = pd.read_csv("data.csv")
d2v_model = Doc2Vec.load("d2vmodel")
clusters = 3
iterations = 100
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations)
X = kmeans_model.fit(d2v_model.docvecs.vectors_docs)
l = kmeans_model.fit_predict(d2v_model.docvecs.vectors_docs)
labels = kmeans_model.labels_.tolist()
pca = PCA(n_components=2).fit(d2v_model.docvecs.vectors_docs)
datapoint = pca.transform(d2v_model.docvecs.vectors_docs)
df["clusters"] = labels
cluster_list = []
cluster_colors = ["#FFFF00", "#008000", "#0000FF"]
plt.figure
color = [cluster_colors[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker="^", s=150, c="#000000")
plt.show()
for i in range(clusters):
df_temp = df[df["clusters"]==i]
cluster_words = Counter(" ".join(df_temp["Body"].str.lower()).split()).most_common(25)
[cluster_list.append(x[0]) for x in cluster_words]
cluster_list.clear()
for Kmeans, when you run fit for multiple time, every time centroid will be initialized randomly. To make it deterministic you can use random_state parameters. you can refer to the docs https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmeans_model = KMeans(n_clusters=clusters, init='k-means++', max_iter=iterations, random_state = 'int number need to given')
Stabilizing the initialization randomization by specifying a random_state (per #qaiser's answer) may help – perhaps by ensuring similar-ish sets of doc-vectors, against same starting KMeans state, tends to find the 'same' clusters in the same named slots.
But there could be situations, where the doc-vectors have a different distribution, or where initialized state is (by bad luck) highly sensitive to doc-vector distribution, where even this repeated-initialization doesn't maintain coherent clusters.
You might want to also consider one or both of:
(1) initializing the KMeans clusters to match the prior run's centroids, to bias the later analysis towards creating compatibly named/centered clusters;
(2) after the second run finishes, rename the clusters according to which (of all possible 3! arbitrary naming permutations of 3 clusters) leaves the smallest possible total distances between each 'new' cluster of the same name to the 'prior' cluster of the same name.
I think the issue might be use of .fit_predict. Try just .predict see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
try:
l = kmeans_model.predict(d2v_model.docvecs.vectors_docs)
similar worked for me
In the following code I create a random sample set of size 50, with 20 features each. I then generate a random target vector composed of half True and half False values.
All of the values are stored in Pandas objects, since this simulates a real scenario in which the data will be given in that way.
I then perform a manual leave-one-out inside a loop, each time selecting an index, dropping its respective data, fitting the rest of the data using a default SVC, and finally running a prediction on the left-out data.
import random
import numpy as np
import pandas as pd
from sklearn.svm import SVC
n_samp = 50
m_features = 20
X_val = np.random.rand(n_samp, m_features)
X = pd.DataFrame(X_val, index=range(n_samp))
# print X_val
y_val = [True] * (n_samp/2) + [False] * (n_samp/2)
random.shuffle(y_val)
y = pd.Series(y_val, index=range(n_samp))
# print y_val
seccess_count = 0
for idx in y.index:
clf = SVC() # Can be inside or outside loop. Result is the same.
# Leave-one-out for the fitting phase
loo_X = X.drop(idx)
loo_y = y.drop(idx)
clf.fit(loo_X.values, loo_y.values)
# Make a prediction on the sample that was left out
pred_X = X.loc[idx:idx]
pred_result = clf.predict(pred_X.values)
print y.loc[idx], pred_result[0] # Actual value vs. predicted value - always opposite!
is_success = y.loc[idx] == pred_result[0]
seccess_count += 1 if is_success else 0
print '\nSeccess Count:', seccess_count # Almost always 0!
Now here's the strange part - I expect to get an accuracy of about 50%, since this is random data, but instead I almost always get exactly 0! I say almost always, since every about 10 runs of this exact code I get a few correct hits.
What's really crazy to me is that if I choose the answers opposite to those predicted, I will get 100% accuracy. On random data!
What am I missing here?
Ok, I think I just figured it out! It all comes down to our old machine learning foe - the majority class.
In more detail: I chose a target comprising 25 True and 25 False values - perfectly balanced. When performing the leave-one-out, this caused a class imbalance, say 24 True and 25 False. Since the SVC was set to default parameters, and run on random data, it probably couldn't find any way to predict the result other than choosing the majority class, which in this iteration would be False! So in every iteration the imbalance was turned against the currently-left-out sample.
All in all - a good lesson in machine learning, and an excelent mathematical riddle to share with your friends :)
I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.
I used matlab and replicate the example given in
http://www.mathworks.se/help/stats/lasso-and-elastic-net.html
to get a figure like this
Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got
Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).
Matlab Code:
rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;
save randomData.mat % To be used in python code
[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');
disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)
Python Code:
import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV
data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()
model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_
eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.
Even if you run 2 times the cross-validation in python you can obtain 2 different results.
consider this example :
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
0.00645093258722
0.00691712356467
it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn
another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.
hope this helps
I know this is an old thread, but:
I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).
Try normalizing the X matrix first when using LassoCV.
If it is a pandas object,
(X - X.mean())/X.std()
It seems you also need to multiple alpha by 2
Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.
These are the facts:
Mathworks have selected an example and decided to include it in their documentation
Your matlab code produces exactly the result as the example.
The alternative does not match the result, and has provided inaccurate results in the past
This is my assumption:
The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.
The logical conclusion: Your matlab implementation of this example is reliable and the other is not.
This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.