I'am performing a multi-class classification task using sci-kit learn. In the setup i created, i want to compare different classification algorithms.
I use a pipeline, where text is inserted as X and Y is the class (multi-class, N = 5). Textual features are extracted in the pipeline using TfidfVectorizer().
KNN does the job, but other classifiers give this: ValueError: bad input shape (670, 5)
Full traceback:
"/Users/Robbert/pipeline.py", line 62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)
The code i use:
def read_data(f):
data = []
for row in csv.reader(open(f), delimiter=';'):
if row:
plottext = row[8]
target = { 'Age': row[4] }
data.append((plottext, target))
(X, Ycat) = zip(*data)
Y = DictVectorizer().fit_transform(Ycat)
Y = preprocessing.LabelBinarizer().fit_transform(Y)
return (X, Y)
X, Y = read_data('development2.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
###KNN Pipeline
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', KNeighborsClassifier(n_neighbors=350, weights='uniform'))])
###Logistic regression Pipeline
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', LogisticRegression())])
##SVC
train_pipeline = Pipeline([
('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
('clf', SVC(C=1, kernel='rbf', gamma=0.001, probability=True))])
##Decision tree
#train_pipeline = Pipeline([
# ('vect', TfidfVectorizer(ngram_range=(1, 3), min_df=1)),
# ('clf', DecisionTreeClassifier(random_state=0))])
train_pipeline.fit(X_train, Y_train)
predicted = train_pipeline.predict(X_test)
print accuracy_score(Y_test, predicted)
How is it possible that KNN accepts the shape of the array and other classifiers don't? And how to change this shape?
If you compare documentation for fit(X, y) function in KNeighborsClassifier and SVC, you will see that only the former one accepts the y in the form [n_samples, n_outputs].
Possible solution: why do you need LabelBinarizer at all? Just do not use it.
If your Y vector is of size (n_samples, n_classes) and contains at least a single row which has more than one non-zero element, then you are solving a multi-label classification problem. If that is the case, The multiclass and multilabel algorithms page in scikit-learn docs lists KNN as one of the classifiers that supports multi-label classification. You might want to try out other classifiers from that list
* sklearn.tree.DecisionTreeClassifier
* sklearn.tree.ExtraTreeClassifier
* sklearn.ensemble.ExtraTreesClassifier
* sklearn.neural_network.MLPClassifier
* sklearn.neighbors.RadiusNeighborsClassifier
* sklearn.ensemble.RandomForestClassifier
* sklearn.linear_model.RidgeClassifierCV
Related
I'm trying to fit a Random Forest regression model. This are the steps I've followed (pls see code below with comments):
Before fitting the model, I've splitted into training and test
I've converted the results into arrays
I've reshaped them to 2D arrays as the regressor likes them to be using the reshape function
I'm getting the following error (it seems that there is a 1D array even though I've reshaped them at the beginning):
ValueError: Expected 2D array, got 1D array instead:
array=[183. 27. 520. ... 23. 28. 34.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample.
Here's the code I've used:
#train & test split
X = order_final.loc[:, ~order_final.columns.isin(['lag','observed'])]
y = order_final['lag']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#convert X,y train and X test into arrays
X_train = X_train.to_numpy()
y_train = y_train.to_numpy()
X_test = X_test.to_numpy()
#make them 2D-arrays
X_train.reshape(-1,1)
y_train.reshape(-1,1)
X_test.reshape(-1,1)
# Fitting Random Forest Regression to the dataset
# import the regressor
from sklearn.ensemble import RandomForestRegressor
# create regressor object
RF = RandomForestRegressor(n_estimators = 100, random_state = 0)
# fit the regressor with x and y data
RF.fit(X_train, y_train)
#Prediction of test set
y_pred = RF.predict(X_test)
# View accuracy score
RF.score(y_test, y_pred)
And here is the shape of my arrays (which look goods to me but...):
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(y_pred.shape)
(7326, 10)
(7326,)
(1832, 10)
(1832,)
(1832,)
Can Someone pls help me out and point me where the error is? Thanks in advance!
Stefano
RF.score() needs the inputs to be reshaped:
RF.score(y_test.reshape(-1,1), y_pred.reshape(-1,1))
You need to pass the inputs used for making your predictions to the score() method instead, like so:
RF.score(X_test, y_test)
This is the code below for jupyter noetbook with python and sklearn function (precision score).wnat to get the score by using t_test & y_test and got this error...
"Classification metrics can't handle a mix of continuous-multioutput and binary targets"
list item code:
diabetes = pd.read_csv("datasets/diabetes.csv")
x = diabetes.drop(diabetes.columns[-1], axis=1)
y = diabetes.iloc[:,-1]
scaler = StandardScaler()
scaler.fit_transform(x)
svc = SVC()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3,
random_state = RANDOM_STATE)
svc.fit(x_train,y_train)
precision_score_svc = precision_score(x_test, y_test)
If you check the help page:
sklearn.metrics.precision_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')
[...]
Parameters
y_true 1d array-like, or label indicator array / sparse matrix
Ground truth (correct) target values.
y_pred 1d array-like, or label indicator array / sparse matrix
Estimated targets as returned by a classifier.
Means you need to provide the true labels followed by predicted labels. It should be:
precision_score_svc = precision_score(y_test,svc.predict(x_test))
I'm trying to calculate the ROC and AUC for an SVM model I'm building. I'm following the code from this sklearn example. One of the requirements is that the output labels y need to be binarized. I do this by using creating a MultiLabelBinarizer and encode all the labels, which works fine. However, this creates a (n_samples, n_features) ndarray. The classifier.fit(X, y) function assumes y.shape = (n_samples). I want to essentially "smush" the columns of y together, so that y[0][0] would return the entire feature-vector, V, instead of just the first value of V.
Here's my code:
enc = MultiLabelBinarizer()
print("Encoding data...")
# Fit the encoder onto all possible data values
print(pandas.DataFrame(enc.fit_transform(df["present"] + df["member"].apply(str).apply(lambda x: [x])),
columns=enc.classes_, index=df.index))
X, y = enc.transform(df["present"]), list(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0)
y_train = enc.transform([[x] for x in y_train]) # Strings to 1HotVectors
svc = svm.SVC(C=1.1, kernel="linear", probability=True, class_weight='balanced')
svc.fit(X_train, y_train) # y_train should be 1D but isn't
The exception I get is:
Traceback (most recent call last):
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 129, in <module>
enc, clf, split_data = encode_and_train(df)
File "C:/Users/SawyerPC/PycharmProjects/DiscordSocialGraph/encode_and_train.py", line 57, in encode_and_train
svc.fit(X_train, y_train) # TODO y_train needs to be flattened to (n_samples,)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\svm\base.py", line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 547, in check_X_y
y = column_or_1d(y, warn=True)
File "C:\Users\SawyerPC\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 583, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (5000, 10)
I ended up fixing this by using a LabelEncoder. Thanks #G.Anderson. The flat_member_list is just a list of all the unique user ids encountered both in the labels y, and the vectors X.
# Encode "present" users as OneHotVectors
mlb = MultiLabelBinarizer()
print("Encoding data...")
mlb.fit(df["present"] + df["member"].apply(str).apply(lambda x: [x]))
# Encode user labels as ints
enc = LabelEncoder()
flat_member_list = df["member"].apply(str).append(pandas.Series(np.concatenate(df["present"]).ravel()))
enc.fit(flat_member_list)
X, y = mlb.transform(df["present"]), enc.transform(df["member"].apply(str))
print("Training svm...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0, stratify=y)
svc = svm.SVC(C=0.317, kernel="linear", probability=True)
svc.fit(X_train, y_train)
i am using sckit-learn to do preprocessing on text data my aim is to get vector representation of my data (features and label) , what i have done is vectorizing the feature using tfidf after that the dataset size changed by double size as X.ravel() is used .
X before (30376, 2)
X after (60752, 41331)
my problem is that when i am having two features in the x vector and i want to get vector representation correctly how i can do it
df = pd.read_csv('Dataset.csv',encoding='latin1')
df = df.dropna()
X = np.array(df.drop(['Type'], 1))
y = np.array(df['Type'])
#print(X)
print("Extracting features from the training data using a sparse vectorizer")
vectorizer= TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X = vectorizer.fit_transform(X.ravel().astype('U'))
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
X = imp.fit_transform(X)
X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
and if that the case when i am using train_test_split i get this error i don't understand what it referring to
TypeError: Singleton array array(TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=0.5, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
stop_words='english', strip_accents=None, sublinear_tf=True,
token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
vocabulary=None), dtype=object) cannot be considered a valid collection.
any suggestion and thanks
Either X or y has a wrong shape.
Here is an excerpt from the validation.py which is used for validating passed data sets:
if hasattr(x, 'shape'):
if len(x.shape) == 0: # <----- NOTE !!!
raise TypeError("Singleton array %r cannot be considered"
" a valid collection." % x)
return x.shape[0]
else:
return len(x)
For a machine learning project, i'm trying to predict a categorical outcome variable using features extracted from text.
Using cross validation, i split my X and Y into a test set and training set. The training set is trained using a pipeline. However, when i compute the performance using X from my test set my performance is 0.0. This is while there are no features extracted from X_test yet.
Is it possible to split the dataset within the pipeline?
My code:
X, Y = read_data('development2.csv')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
train_pipeline = Pipeline([('vect', CountVectorizer()), #ngram_range=(1,2), analyzer='word'
('tfidf', TfidfTransformer(use_idf=False)),
('clf', OneVsRestClassifier(SVC(kernel='linear', probability=True))),
])
train_pipeline.fit(X_train, Y_train)
predicted = train_pipeline.predict(X_test)
print accuracy_score(Y_test, predicted)
The traceback when using SVC:
File "/Users/Robbert/Documents/pipeline.py", line 62, in <module>
train_pipeline.fit(X_train, Y_train)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 138, in fit
y = self._validate_targets(y)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 441, in _validate_targets
y_ = column_or_1d(y, warn=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 319, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (670, 5)
I solved the problem.
The target variable (Y) did not have the appropriate format. The variables were stored like this: [[0 0 0 0 1],[0 0 1 0 0]]. I converted this to a different array format like this: [5, 3].
This did the trick for me.
Thanks for all answers.