I want to use BernoulliRBM implementation in scikit-learn for Restricted Boltzmann Machines, but I can’t find anywhere a way or a parameter to set the number of Gibbs steps k for the PCD sampling. Should I assume that k=1 and can't be modified?
Yes, the training algorithm uses a hardwired "k". It can be seen in the _fit method, which samples once, then updates the parameters.
Related
I'm using locally linear embedding (LLE) method in Scikit-learn for dimensionality reduction. The only example that I could find belong to the Scikit-learn documentation here and here, but I'm not sure how should I choose the parameters of the method. In particular, is there any relation between the dimension of data points or the number of samples and the number of neighbors (n_neighbors) and number of components (n_components)? All of the examples in Scikit-learn use n_components=2, is this always the case? Finally, is there any other parameter that is critical to tune, or I should use the default setting for the rest of parameters?
Is there any relation between the dimension of data points or the number of samples and the number of neighbors (n_neighbors) and number of components (n_components)?
Generally speaking, not related. n_neighbors is often decided by the distances among samples. Especially, if you know the classes of your samples, you'd better set n_neighbors a little bit greater than the number of samples in each class. While n_components, namely the reduced dimension size, is determined by the redundancy of data in original dimension. Based on the specific data distribution and your own demands, you can choose the proper space dimension for projection.
n_components=2 is to mapping the original high-dimensional space into a 2d-space. It is a special case, actually.
Is there any other parameter that is critical to tune, or I should use the default setting for the rest of parameters?
Here are several other parameters you should take care of.
reg for weight regularization, which is not used in the original LLE paper. If you don't want to use it, just simply set it to zero. However, the default value of reg is 1e-3, which is quite small.
eigen_solver. If your data size is small, it is recommended to use dense for accuracy. You can do more research on this.
max_iter. The default value of max_iter is only 100, which often causes the results not converged. If the results are not stable, please choose a larger interger.
You can use GridSearch (Scikit-learn) to choose the best values for you.
I'm using scikit-learn to perform classification using SVM. I'm performing a binary classification task.
0: Does not belong to class A
1: Belongs to class A
Now, I want to optimize the parameters such that I get high recall. I don't care much about a few false positives but the objects belonging to class A should not be labelled as not belonging to A often.
I use a SVM with linear kernel.
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X,Y)
clf.predict(...)
How should I choose other SVM parameters like C? Also, what is the difference between SVC with a linear kernel and LinearSVC?
The choice of the kernel is really dependent on the data, so picking the kernel based on a plot of the data might be the way to go. This could be automated by running through all kernel types and picking the one that gives you either high/low recall or bias, whatever you're looking for. You can see for yourself the visual difference of the kernels.
Depending on the kernel different arguments of the SVC constructor are important, but in general the C is possibly the most influential, as it's the penalty for getting it wrong. Decreasing C would increase the recall.
Other than that there's more ways to get a better fit, for example by adding more features to the n_features of the X matrix passed on to svm.fit(X,y).
And of course it can always be useful to plot the precision/recall to get a better feel of what the parameters are doing.
Generally speaking you can tackle this problem by penalizing the two types of errors differently during the learning procedure. If you take a look at the loss function, in particular in the primal/parametric setting, you can think of scaling the penalty of false-negatives by alpha and penalty of false-positives by (1 - alpha), where alpha is in [0 1]. (To similar effect would be duplicating the number of positive instances in your training set, but this makes your problem unnecessarily larger, which should be avoided for efficiency)
You can choose the SVM parameter C, which is basically your penalty term, by cross-validation. Here you can use K-Fold cross-validation. You can also use a sklearn class called gridsearchCV in which you can pass your model and then perform cross-validation on it using the cv parameter.
According to linearSVC documentation -
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
I am trying to do logistic regression on a huge data set using scikit-learn SGDClassifier (I am using partial_fit to be precise). The coefficients I obtained are of different sign, whereas I would like to force the classifier to look only for positive values (I know it may not be the best approach in terms of methodology however it is what would be ok for now).
My question is:
Is there any way to impose constraints on coefficients using SGDClassifier?
Thanks for your time
This is not possible with SGDClassifier in its current implementation.
If you wanted to implement this, you have to add a penalty, call it e.g. 'positivity', which makes sure this constraint is verified by placing infinite cost on negative values.
It may be possible to implement this using e.g. this paper, Duchi 2009 (but I think there are follow-ups in newer literature that could be more up to the job). What you need to do at every mini-batch is to project onto the positive orthant. This is done by simply setting all negative values the occur after a gradient step in the logistic loss to 0.
What is strange is that it seems to be exactly the same code for the fit and for the partial_fit.
You can see the code at the following link :
https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L478
Not exactly the same code; partial_fit uses total_samples:
"
total_samples : int, optional (default=1e6)
Total number of documents. Only used in the partial_fit method."
https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L184
(partial fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L472
(fit) https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/decomposition/online_lda.py#L510
Just in case it is of your interest: partial_fit is a good candidate to be used whenever your dataset is really really big. So, instead of running into possible memory problems you perform your fitting in smaller batches, which is called incremental learning.
So, in your case you should take into account that total_samples default's value is 1000000.0. Therefore, if you don't change this number and your real number of samples is bigger then you'll get different results from the fit method and fit_partial. Or maybe it could be the case that you are using mini-batches in the fit_partial and not covering all the samples that you provide to the fit method. And even if you do this right, you could also get different results, as stated in the documentation:
"the incremental learner itself may be unable to cope with new/unseen targets classes. In this case you have to pass all the possible classes to the first partial_fit call using the classes= parameter."
"[...] choosing a proper algorithm is that all of them don’t put the same importance on each example over time [...]"
sklearn documentation: https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning
I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on
Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.
Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.
from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.
Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn