I have a BoW vectors of shape (100000, 56000) and I want to use MultinomialNB from scikit-learn for a classification task.
Does MultinomialNB take sparse matrix for fitting the data?
I can't seem to convert it into dense matrixtoarray() due to memory error. If NB classifier doesn't take sparse matrix are there any alternatives I could use for fitting the data without converting it into dense matrix?
From the documentation of MultinomialNB.fit (emphasis added):
fit(X, y, sample_weight=None)
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
Related
I want to fit these two lists with sklearn but at the end it say : could not convert string to float... can you help me with that?
from sklearn import tree
x = ['BMW', '20000miles', '2010']
y = ['12000']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)
A number of things.
From the documentation:
X{array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
yarray-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels) as integers or strings.
Your input to fit should be an array of shape (n_samples, n_features). Do you have 1 sample with 3 features? I suppose that is ok, but fitting 1 sample doesn't make much sense.
But your model can't interpret "BMW", it expects a float. So if you have 3 types of cars, BMW, AUDI, MERCEDES, convert them to a number, i.e. 1,2,3 to represent them.
I have applied CountVectorizer() on my X_train and it returned a sparse matrix.
Usually if we want to Standardize sparse matrix we pass in with_mean=False param.
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform()
But In my case after applying CountVectorizer on my X_train I have also performed PCA(TruncatedSVD) to reduce dimensions. Now my data is not a sparse matrix.
So now can I apply StandardScaler() directly without passing with_mean=False (i.e with_mean=True)?
If you take a look at what the with_mean parameter does, you'll find that it simply centers your data before scaling. The reason why you don't center a sparse matrix is because when you try to center a sparse matrix it will get transformed into a dense matrix and will occupy much more memory, thus destroying its sparsity in the first place.
After you perform PCA your data has reduced dimensions and can now be centered before scaling. So yes, you can apply StandardScaler() directly.
I'm using the sklearn SVC implementation for multiclass SVM.
My model is supposed to have multiple outputs so I'm using One-Hot-Enconding on my labels (MultiLabelBinarizer).
mlb = MultiLabelBinarizer(classes=classes, sparse_output=True)
y_train = mlb.transform(y_train)
mlb.fit(y_train)
This gives me a vector of labels per sample, y_train is a csr_matrix of shape (n_samples, n_classes) specifically (18171, 17).
My training set is in the form of a scipy csc_matrix of shape (n_samples, n_feature) specifically (18171, 1001).
m_o_SVC = MultiOutputClassifier(SVC(C=0.1, kernel='linear', probability=True), n_jobs=-1)
m_o_SVC.fit(X_train, y_train)
This trains several classifiers each with a slice of the labels.
But I get this Warning:
"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()."
And this error:
"Found input variables with inconsistent numbers of samples: [18171, 1]"
If I don't use a sparse matrix for the labels everything works but I am not sure whether using a dense label representation will cause the algorithm to work with dense matrices (with a loss of performance).
Also, I don't understand the problem since the shapes are consistent.
Is this a problem with sklearn?
I am trying to do a dimension reduction using PCA from scikit-learn. My data set has around 300 samples and 4096 features. I want to reduce the dimensions to 400 and 40. But when I call the algorithm the resulting data does have at most "number of samples" features.
from sklearn.decomposition import PCA
pca = PCA(n_components = 400)
trainData = pca.fit_transform(trainData)
testData = pca.transform(testData)
Where initial shape of trainData is 300x4096 and the resulting data shape is 300x300. Is there any way to perform this operation on this kind of data (lot of features, few samples)?
The maximum number of principal components that can be extracted from and M x N dataset is min(M, N). Its not an algorithm issue. Fundamentally, that is the maximum number that there are.
How can I do classification or regression in sklearn if I want to weight each sample differently? Is there a way to do it with a custom loss function? If so, what does that loss function look like in general? Is there an easier way?
To weigh individual samples, feed a sample_weight array to the estimator's fit method. This should be a 1-d array of length n_samples (i.e. the same dimension as y in most tasks):
estimator.fit(X, y, sample_weight=some_array)
Not all models support this, check the documentation.