my simple prediction with linear regression wont execute - python

Below is my trial code:
from sklearn import linear_model
# plt.title("Time-independent variant student performance analysis")
x_train = [5, 9, 33, 25, 4]
y_train = [35, 2, 14 ,9, 7]
x_test = [14, 2, 8, 1, 11]
# create linear regression object
linear = linear_model.LinearRegression()
#train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
# predict output
predicted = linear.predict(x_test)
when run, this is the output:
ValueError: Found arrays with inconsistent numbers of samples: [1 5]

Redefine
x_train = [[5],[9],[33],[25],[4]]
y_train = [35,2,14,9,7]
x_test = [[14],[2],[8],[1],[11]]
From doc of fit(X, y): X : numpy array or sparse matrix of shape [n_samples,n_features]
In your case, every example has only one feature.

Related

Using GridSearchCV best_params_ gives poor results

I'm trying to tune hyperparameters for KNN on a quite small datasets ( Kaggle Leaf which has around 990 lines ):
def knnTuning(self, x_train, t_train):
params = {
'n_neighbors': [1, 2, 3, 4, 5, 7, 9],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
grid = GridSearchCV(KNeighborsClassifier(), params)
grid.fit(x_train, t_train)
print(grid.best_params_)
print(grid.best_score_)
return knn.KNN(neighbors=grid.best_params_["n_neighbors"],
weight = grid.best_params_["weights"],
leafSize = grid.best_params_["leaf_size"])
Prints:
{'leaf_size': 5, 'n_neighbors': 1, 'weights': 'uniform'}
0.9119999999999999
And I return this classifier
class KNN:
def __init__(self, neighbors=1, weight = 'uniform', leafSize = 10):
self.clf = KNeighborsClassifier(n_neighbors = neighbors,
weights = weight, leaf_size = leafSize)
def train(self, X, t):
self.clf.fit(X, t)
def predict(self, x):
return self.clf.predict(x)
def global_accuracy(self, X, t):
predicted = self.predict(X)
accuracy = (predicted == t).mean()
return accuracy
I run this several time using 700 lines for the training and 200 for validation, which are chosen with random permutation.
I then got result for the global accuracy from 0.01 (often) to 0.4 (rarely).
I know that i'm not comparing two same metrics but I still can't understand the huge difference between the results.
Not very sure how you trained your model or how the preprocessing was done. The leaf dataset has about 100 labels (species) so you have to take care to split your test and train to ensure an even split of your samples. One reason for the weird accuracy could be that your samples are split unevenly.
Also you would need to scale your features:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
df = pd.read_csv("https://raw.githubusercontent.com/WenjinTao/Leaf-Classification--Kaggle/master/train.csv")
le = LabelEncoder()
scaler = StandardScaler()
X = df.drop(['id','species'],axis=1)
X = scaler.fit_transform(X)
y = le.fit_transform(df['species'])
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=0).split(X,y)
x_train, y_train, x_test, y_test = [[X.iloc[train,:],t[train],X.iloc[test,:],t[test]] for train,test in strat][0]
If we do the training, and I would be careful about including n_neighbors = 1 :
params = {
'n_neighbors': [2, 3, 4],
'weights': ['uniform', 'distance'],
'leaf_size': [5,10, 15, 20]
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
grid = GridSearchCV(KNeighborsClassifier(), params, cv=sss)
grid.fit(x_train, y_train)
print(grid.best_params_)
print(grid.best_score_)
{'leaf_size': 5, 'n_neighbors': 2, 'weights': 'distance'}
0.9676258992805755
Then you can check on your test:
pred = grid.predict(x_test)
(y_test == pred).mean()
0.9831649831649831

Python Scikit-Learn RandomizedSearchCV with custom scoring functions

I am using Scikit-Learn's Random Forest Regressor, Pipeline, and RandomizedSearchCV to predict the target variable using some features in my dataset. I need to use my own custom scoring functions that calculate weighted scores using weights (signifying the importance of observations) from the dataset. My code seems to work but I am getting a warning when the grid runs:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for examples using ravel(). self.__final_estimator.fit(Xt, y, **fit_params)
This is related to .fit(X_train, y_train). Based on this warning, if I change the code to .fit(X_train, y_train.values.ravel()) then I cannot get my weighted scores to work. I have tried editing the code in different/appropriate ways to get the weighted scores to work but to no avail.
I am including my code below that runs on a test data in test.csv. The file has four columns: two feature columns ('x1', 'x2'), target ('y') and weight ('weight') columns. The custom scoring functions below are simple functions that calculate weighted rmse_score and mean_abs_error_score. How can I use .fit(X_train, y_train.values.ravel()) and still compute the scores?
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def rmse_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
rmse = np.sqrt(np.mean(weight*(y_true.values-y_pred)**2))
return rmse
def mean_abs_error_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
mae = np.mean(weight*np.absolute(y_true.values-y_pred))
return mae
#---- reading data
heart_df = pd.read_csv('data\\test.csv')
#---- splitting into training & testing sets
y = heart_df['y']
X = heart_df[['x1', 'x2']]
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, test_size=0.20)
X_train_weights = heart_df['weight'].loc[X_train.index.values]
params = {"weight": X_train_weights}
my_scorer1 = make_scorer(rmse_score, greater_is_better=False, **params)
my_scorer2 = make_scorer(mean_abs_error_score, greater_is_better=False, **params)
#---- random forest training with hyperparameter tuning
pipe = Pipeline([("scaler", StandardScaler()), ("rfr", RandomForestRegressor())])
random_grid = { "rfr__n_estimators": [10, 100, 500, 1000],
"rfr__max_depth": [10, 20, 30, 40, 50, None],
"rfr__max_features": [0.25, 0.50, 0.75],
"rfr__min_samples_split": [5, 10, 20],
"rfr__min_samples_leaf": [3, 5, 10],
"rfr__bootstrap": [True, False]
}
rfr_cv = skms.RandomizedSearchCV(pipe,
param_distributions=random_grid,
n_iter = 15,
cv = 3,
verbose=3,
scoring={'rmse': my_scorer1, 'mae':my_scorer2},
refit = 'rmse',
random_state=42,
n_jobs = -1)
rfr_cv.fit(X_train, y_train)
best_params = rfr_cv.best_params_
best_score = rfr_cv.best_score_
print(f'best hyperparameters = {best_params}')
print(f'best score = {best_score}')

When I try to fit scikit-learn model with 1 more feature, I have this error "ValueError: Found input variables with inconsistent numbers of samples"

I have this code working fine
df_amazon = pd.read_csv ("datasets/amazon_alexa.tsv", sep="\t")
X = df_amazon['variation'] # the features we want to analyze
ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)
# Create pipeline using Bag of Words
pipe = Pipeline([('cleaner', predictors()),
('vectorizer', bow_vector),
('classifier', classifier)])
pipe.fit(X_train,y_train)
But if I try to add 1 more feature to the model, replacing
X = df_amazon['variation']
by
X = df_amazon[['variation','verified_reviews']]
I have this error message from Sklearn when I call fit:
ValueError: Found input variables with inconsistent numbers of samples: [2, 2205]
So fit works when X_train and y_train have the shapes
(2205,) and (2205,).
But not when the shapes are changed to
(2205, 2) and (2205,).
What's the best way to deal with that?
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame(data = [['Heather Gray Fabric','I received the echo as a gift.',1],['Sandstone Fabric','Without having a cellphone, I cannot use many of her features',0]], columns = ['variation','review','feedback'])
vect = CountVectorizer()
vect.fit_transform(df[['variation','review']])
# now when you look at vocab that has been created
print(vect.vocabulary_)
#o/p, where feature has been generated only for column name and not content of particular column
Out[49]:
{'variation': 1, 'review': 0}
#so you need to make one column which contain which contain variation and review both and that need to be passed into your model
df['variation_review'] = df['variation'] + df['review']
vect.fit_transform(df['variation_review'])
print(vect.vocabulary_)
{'heather': 8,
'gray': 6,
'fabrici': 3,
'received': 9,
'the': 11,
'echo': 2,
'as': 0,
'gift': 5,
'sandstone': 10,
'fabricwithout': 4,
'having': 7,
'cellphone': 1}
The data must have a shape (n_samples, n_features). Try to traspose X (X.T).

How to apply standardization to train and test datasets

Let's say I have a 10 feature dataset X of shape [100, 10] and a ytarget dataset of shape [100, 1].
For example, after splitting the two with sklearn.model_selection.train_test_split I obtained:
X_train: [70, 10]
X_test: [30, 10]
y_train: [70, 1]
y_test: [30, 1]
What is the correct way of apply standardization?
I've tried with:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)
but then if I try to predict using a model, when I try to inverse the scaling for looking at the MAE, I have an error
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(X_train_std, y_train)
y_pred_std = lr.predict(X_test_std)
y_pred = scaler.inverse_transform(y_pred_std) # error here
I have also another question. Since I have the target values, should I use
scaler = preprocessing.StandardScaler()
X_train_std = scaler.fit_transform(X_train, y_train)
X_test_std = scaler.transform(X_test)
instead of the first code block?
Do I have to apply the transformation also to the y_train and y_test datasets? I am a bit confuse
StandardScaler is supposed to be used on the feature matrix X only.
So all the fit, transform and inverse_transform methods just need your X.
Note that after you fit the model, you can access the following attributes:
mean_: mean of each feature in X_train
scale_: standard deviation of each feature in X_train
The transform method does (X[i, col] - mean_[col] / scale_[col]) for each sample i. Whereas the inverse_transform method (X[i, col] * scale_[col] + mean_[col]) for each sample i.

How to nest LabelKFold?

I have a dataset with ~300 points and 32 distinct labels and I want to evaluate a LinearSVR model by plotting its learning curve using grid search and LabelKFold validation.
The code I have looks like this:
import numpy as np
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import LabelKFold
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
...
#get data (x, y, labels)
...
C_space = np.logspace(-3, 3, 10)
epsilon_space = np.logspace(-3, 3, 10)
svr_estimator = Pipeline([
("scale", preprocessing.StandardScaler()),
("svr", LinearSVR),
])
search_params = dict(
svr__C = C_space,
svr__epsilon = epsilon_space
)
kfold = LabelKFold(labels, 5)
svr_search = GridSearchCV(svr_estimator, param_grid = search_params, cv = ???)
train_space = np.linspace(.5, 1, 10)
train_sizes, train_scores, valid_scores = learning_curve(svr_search, x, y, train_sizes = train_space, cv = ???, n_jobs = 4)
...
#plot learning curve
My question is how to setup the cv attribute for the grid search and learning curve so that it will break my original set into training and test sets that don't share any labels for computing the learning curve. And then from those training sets, further separate them into training and test sets without sharing labels for the grid search?
Essentially, how do I run a nested LabelKFold?
I, the user who created the bounty for this question, wrote the following reproducible example using data available from sklearn.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, LabelKFold
digits = load_digits()
X = digits['data']
Y = digits['target']
Z = np.zeros_like(Y) ## this is just to make a 2-class problem, purely for the sake of an example
Z[np.where(Y>4)]=1
strata = [x % 13 for x in xrange(Y.size)] # define the strata for use in
## define stuff for nested cv...
mtry = [5, 10]
tuned_par = {'max_features': mtry}
toy_rf = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=10,
class_weight="balanced")
roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=True)
## define outer k-fold label-aware cv
outer_cv = LabelKFold(labels=strata, n_folds=5)
#############################################################################
## this works: using regular randomly-allocated 10-fold CV in the inner folds
#############################################################################
vanilla_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
cv=5, n_jobs=1)
vanilla_results = cross_val_score(vanilla_clf, X=X, y=Z, cv=outer_cv, n_jobs=1)
##########################################################################
## this does not work: attempting to use label-aware CV in the inner loop
##########################################################################
inner_cv = LabelKFold(labels=strata, n_folds=5)
nested_kfold_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
cv=inner_cv, n_jobs=1)
nested_kfold_results = cross_val_score(nested_kfold_clf, X=X, y=Y, cv=outer_cv, n_jobs=1)
From your question, you are looking for the LabelKFold score on your data, while grid-searching the parameters of your pipeline in each of the iterations of this outer LabelKFold, using again a LabelKFold. Although I was not able to achieve that out-of-the-box it takes only one loop:
outer_cv = LabelKFold(labels=strata, n_folds=3)
strata = np.array(strata)
scores = []
for outer_train, outer_test in outer_cv:
print "Outer set. Train:", set(strata[outer_train]), "\tTest:", set(strata[outer_test])
inner_cv = LabelKFold(labels=strata[outer_train], n_folds=3)
print "\tInner:"
for inner_train, inner_test in inner_cv:
print "\t\tTrain:", set(strata[outer_train][inner_train]), "\tTest:", set(strata[outer_train][inner_test])
clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer, cv= inner_cv, n_jobs=1)
clf.fit(X[outer_train],Z[outer_train])
scores.append(clf.score(X[outer_test], Z[outer_test]))
Running the code, the first iteration yields:
Outer set. Train: set([0, 1, 4, 5, 7, 8, 10, 11]) Test: set([9, 2, 3, 12, 6])
Inner:
Train: set([0, 10, 11, 5, 7]) Test: set([8, 1, 4])
Train: set([1, 4, 5, 8, 10, 11]) Test: set([0, 7])
Train: set([0, 1, 4, 8, 7]) Test: set([10, 11, 5])
Hence, it is easy to verify that it executes as intended. Your cross-validation scores are in the list scores and you can easily process them. I have used the variables, e.g., strata you defined in your last piece of code.

Categories