how can I fit two lists in python? - python

I want to fit these two lists with sklearn but at the end it say : could not convert string to float... can you help me with that?
from sklearn import tree
x = ['BMW', '20000miles', '2010']
y = ['12000']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x, y)

A number of things.
From the documentation:
X{array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples. Internally, it will be converted to dtype=np.float32 and if a sparse matrix is provided to a sparse csc_matrix.
yarray-like of shape (n_samples,) or (n_samples, n_outputs)
The target values (class labels) as integers or strings.
Your input to fit should be an array of shape (n_samples, n_features). Do you have 1 sample with 3 features? I suppose that is ok, but fitting 1 sample doesn't make much sense.
But your model can't interpret "BMW", it expects a float. So if you have 3 types of cars, BMW, AUDI, MERCEDES, convert them to a number, i.e. 1,2,3 to represent them.

Related

Expected 2D Array,got 1D array instead.Where's the mistake?

I am beginning to learn SVM and PCA.I tried to apply SVM on the Sci-Kit Learn 'load_digits' dataset.
When i apply the .fit method to SVC,i get an error:
"Expected 2D array, got 1D array instead:
array=[ 1.9142151 0.58897807 1.30203491 ... 1.02259477 1.07605691
-1.25769703].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample."
Here is the code i wrote:**
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
from sklearn.svm import SVC
clf = SVC(kernel='rbf', C=1E6)
X=[reduced_data[:,0]
y=reduced_data[:,1]
clf.fit(X, y)
Can someone help me out?Thank you in advance.
Your error results from the fact that clf.fit() requires the array X to be of dimension 2 (currently it is 1 dimensional), and by using X.reshape(-1, 1), X becomes a (N,1) (2D - as we would like) array, as opposed to (N,) (1D), where N is the number of samples in the dataset. However, I also believe that your interpretation of reduced_data may be incorrect (from my limited experience of sklearn):
The reduced_data array that you have contains two principle components (the two most important features in the dataset, n_components=2), which you should be using as the new "data" (X).
Instead, you have taken the first column of reduced_data to be the samples X, and the second column to be the target values y. It is to my understanding that a better approach would be to make X = reduced_data since the sample data should consist of both PCA features, and make y = y_digits, since the labels (targets) are unchanged by PCA.
(I also noticed you defined pca = PCA(n_components=10).fit_transform(data), but did not go on to use it, so I have removed it from the code in my answer).
As a result, you would have something like this:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.svm import SVC
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
# pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
clf = SVC(kernel='rbf', C=1e6)
clf.fit(reduced_data, y_digits)
I hope this has helped!

Dimensions do not match in linear regression

I am trying a simple linear regression model but don't understand why an error like this appears:
Here is my code:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, Y)
which produces following error:
ValueError: Found input variables with inconsistent numbers of samples: [1518, 15]
The shapes of X and Y are:
X.shape, Y.shape
((1518, 1), (15, 1))
I am trying to predict these Y out of X but my dimensions are not the same; how can I overcome this problem?
It looks like you split your features and explanatory variables wrong way.
Given on what you have written, you have N=1518 samples and 15 features, one of which is the outcome variable.
If this is the case you input vector for Y and matrix for X should take the shapes:
X.shape = (1518,14)
Y.shape = (1518,1)
Assume you are given a pd.dataframe, with features names F1...F15 and your dependent variable Y is F3, then you can split your variables as follows:
Y = df['F3']
X = df.drop('F3', axis=1)
Note: if you are currently using a numpy array, you an easily wrap this in a dataframe using:
import pandas as pd
df = pd.DataFrame(np_array)

Does sparse matrix work with MultinomialNB?

I have a BoW vectors of shape (100000, 56000) and I want to use MultinomialNB from scikit-learn for a classification task.
Does MultinomialNB take sparse matrix for fitting the data?
I can't seem to convert it into dense matrixtoarray() due to memory error. If NB classifier doesn't take sparse matrix are there any alternatives I could use for fitting the data without converting it into dense matrix?
From the documentation of MultinomialNB.fit (emphasis added):
fit(X, y, sample_weight=None)
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.

Is it possible to set a "threshold" for a scikit-learn ensemble classifier?

I have a VotingClassifier comprised of 200 individual SVM classifiers. By default, this classifier uses majority-rule voting. I want to set a custom threshold - where a classification is only made if 60% or more of the SVM classifiers are the same.
If 59% of SVM classifiers have the same classification, I do not want the ensemble model to make a classification.
I don't see a parameter to do this for the VotingClassifier object, but I assume it must be possible somewhere in scikit-learn. Is there a different ensemble class I should be using?
based on the methods you get at the end of the page, the simplest solution is to use the transform methods:
def transform(self, X):
"""Return class labels or probabilities for X for each estimator.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and
n_features is the number of features.
Returns
-------
If `voting='soft'` and `flatten_transform=True`:
array-like = (n_classifiers, n_samples * n_classes)
otherwise array-like = (n_classifiers, n_samples, n_classes)
Class probabilities calculated by each classifier.
If `voting='hard'`:
array-like = [n_samples, n_classifiers]
Class labels predicted by each classifier.
"""
just do a simple function that will get the sum for a line divided by the number of SVM and apply your Threshold:
if(ratio>threshold):
return 1
elif(ratio<(1-threshold)):
return 0
else:
#we don't make the prediction
return -1

Sklearn, inconsistent numbers of samples using sparse matrices

I'm using the sklearn SVC implementation for multiclass SVM.
My model is supposed to have multiple outputs so I'm using One-Hot-Enconding on my labels (MultiLabelBinarizer).
mlb = MultiLabelBinarizer(classes=classes, sparse_output=True)
y_train = mlb.transform(y_train)
mlb.fit(y_train)
This gives me a vector of labels per sample, y_train is a csr_matrix of shape (n_samples, n_classes) specifically (18171, 17).
My training set is in the form of a scipy csc_matrix of shape (n_samples, n_feature) specifically (18171, 1001).
m_o_SVC = MultiOutputClassifier(SVC(C=0.1, kernel='linear', probability=True), n_jobs=-1)
m_o_SVC.fit(X_train, y_train)
This trains several classifiers each with a slice of the labels.
But I get this Warning:
"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()."
And this error:
"Found input variables with inconsistent numbers of samples: [18171, 1]"
If I don't use a sparse matrix for the labels everything works but I am not sure whether using a dense label representation will cause the algorithm to work with dense matrices (with a loss of performance).
Also, I don't understand the problem since the shapes are consistent.
Is this a problem with sklearn?

Categories