Sklearn, inconsistent numbers of samples using sparse matrices

Sklearn, inconsistent numbers of samples using sparse matrices - python

I'm using the sklearn SVC implementation for multiclass SVM.
My model is supposed to have multiple outputs so I'm using One-Hot-Enconding on my labels (MultiLabelBinarizer).
mlb = MultiLabelBinarizer(classes=classes, sparse_output=True)
y_train = mlb.transform(y_train)
mlb.fit(y_train)
This gives me a vector of labels per sample, y_train is a csr_matrix of shape (n_samples, n_classes) specifically (18171, 17).
My training set is in the form of a scipy csc_matrix of shape (n_samples, n_feature) specifically (18171, 1001).
m_o_SVC = MultiOutputClassifier(SVC(C=0.1, kernel='linear', probability=True), n_jobs=-1)
m_o_SVC.fit(X_train, y_train)
This trains several classifiers each with a slice of the labels.
But I get this Warning:
"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()."
And this error:
"Found input variables with inconsistent numbers of samples: [18171, 1]"
If I don't use a sparse matrix for the labels everything works but I am not sure whether using a dense label representation will cause the algorithm to work with dense matrices (with a loss of performance).
Also, I don't understand the problem since the shapes are consistent.
Is this a problem with sklearn?

Related

how can I fit two lists in python?

I want to fit these two lists with sklearn but at the end it say : could not convert string to float... can you help me with that?
from sklearn import tree
x = ['BMW', '20000miles', '2010']
y = ['12000']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)

A number of things.
From the documentation:
X{array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
yarray-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels) as integers or strings.
Your input to fit should be an array of shape (n_samples, n_features). Do you have 1 sample with 3 features? I suppose that is ok, but fitting 1 sample doesn't make much sense.
But your model can't interpret "BMW", it expects a float. So if you have 3 types of cars, BMW, AUDI, MERCEDES, convert them to a number, i.e. 1,2,3 to represent them.

Does sparse matrix work with MultinomialNB?

I have a BoW vectors of shape (100000, 56000) and I want to use MultinomialNB from scikit-learn for a classification task.
Does MultinomialNB take sparse matrix for fitting the data?
I can't seem to convert it into dense matrixtoarray() due to memory error. If NB classifier doesn't take sparse matrix are there any alternatives I could use for fitting the data without converting it into dense matrix?

From the documentation of MultinomialNB.fit (emphasis added):
fit(X, y, sample_weight=None)
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.

Can I standardize my PCA applied count vector?

I have applied CountVectorizer() on my X_train and it returned a sparse matrix.
Usually if we want to Standardize sparse matrix we pass in with_mean=False param.
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform()
But In my case after applying CountVectorizer on my X_train I have also performed PCA(TruncatedSVD) to reduce dimensions. Now my data is not a sparse matrix.
So now can I apply StandardScaler() directly without passing with_mean=False (i.e with_mean=True)?

If you take a look at what the with_mean parameter does, you'll find that it simply centers your data before scaling. The reason why you don't center a sparse matrix is because when you try to center a sparse matrix it will get transformed into a dense matrix and will occupy much more memory, thus destroying its sparsity in the first place.
After you perform PCA your data has reduced dimensions and can now be centered before scaling. So yes, you can apply StandardScaler() directly.

Is it possible to set a "threshold" for a scikit-learn ensemble classifier?

I have a VotingClassifier comprised of 200 individual SVM classifiers. By default, this classifier uses majority-rule voting. I want to set a custom threshold - where a classification is only made if 60% or more of the SVM classifiers are the same.
If 59% of SVM classifiers have the same classification, I do not want the ensemble model to make a classification.
I don't see a parameter to do this for the VotingClassifier object, but I assume it must be possible somewhere in scikit-learn. Is there a different ensemble class I should be using?

based on the methods you get at the end of the page, the simplest solution is to use the transform methods:
def transform(self, X):
"""Return class labels or probabilities for X for each estimator.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and
n_features is the number of features.
Returns
-------
If `voting='soft'` and `flatten_transform=True`:
array-like = (n_classifiers, n_samples * n_classes)
otherwise array-like = (n_classifiers, n_samples, n_classes)
Class probabilities calculated by each classifier.
If `voting='hard'`:
array-like = [n_samples, n_classifiers]
Class labels predicted by each classifier.
"""
just do a simple function that will get the sum for a line divided by the number of SVM and apply your Threshold:
if(ratio>threshold):
return 1
elif(ratio<(1-threshold)):
return 0
else:
#we don't make the prediction
return -1

Scikit-Learn Classification and Regression with Weights

How can I do classification or regression in sklearn if I want to weight each sample differently? Is there a way to do it with a custom loss function? If so, what does that loss function look like in general? Is there an easier way?

To weigh individual samples, feed a sample_weight array to the estimator's fit method. This should be a 1-d array of length n_samples (i.e. the same dimension as y in most tasks):
estimator.fit(X, y, sample_weight=some_array)
Not all models support this, check the documentation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sklearn, inconsistent numbers of samples using sparse matrices - python

Related

how can I fit two lists in python?

Does sparse matrix work with MultinomialNB?

Can I standardize my PCA applied count vector?

Is it possible to set a "threshold" for a scikit-learn ensemble classifier?

Scikit-Learn Classification and Regression with Weights

Categories

Resources