Trying to use sklearn.dummy DummyRegressor to create a baseline for my model which is a regression model with encoded categorical variables to predict a continuous target. Baseline strategy is 'min' and I'd like the min by group. Below is a reproducible example. My actual dataset is larger and it is a collection of runners ('a' ids) racing on courses ('c' ids) with the time they recorded for that performance as the target 'T'. I'm trying to see if the model performs better that the runner's best/fastest recorded time (min).
df = pd.DataFrame([['a1','c1',10],
['a1','c2',15],
['a1','c3',20],
['a1','c1',15],
['a2','c2',26],
['a4','c3',15],
['a2','c1',23],
['a2','c2',15],
['a3','c3',20],
['a3','c3',13],
['a1','c3',19],
['a4','c3',19],
['a3','c3',12],
['a3','c3',20]], columns=['aid','cid','T'])
X = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
X.drop(['T'], axis=1, inplace=True)
y = df['T']
# train test split 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regr = LinearRegression()
Lin_model = regr.fit(X_train, y_train)
y_pred = Lin_model.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
For comparison I'd like to use DummyRegressor. Using mean as strategy it works and as I understand it is using the mean of the entire column.
dummy_mean = DummyRegressor(strategy='mean')
dummy_mean.fit(X_train, y_train)
y_pred2 = dummy_mean.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred2))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred2))
to compare to the lowest T or fastest/best time I tried the constant function and defined it as min value by group
min_value = df.groupby('aid').agg({'T': ['min']})
dummy_min = DummyRegressor(strategy='constant',constant = min_value)
dummy_min.fit(X_train, y_train)
y_pred4 = dummy_min.predict(X_test)
which returns
ValueError: could not broadcast input array from shape (1,3) into shape (3,1)
what am I missing?
When you use min_value = df.groupby('aid').agg({'T': ['min']}) the shape of the data frame is changed to (3,1) try to change it tomin_value = df.groupby('aid').agg({'T': ['min']}).values.reshape(1,-1) Hope it helps.
Related
The thing is, I want to know is there any difference if I use model.decision_function() and take all the outputs which are >0 for minority class Vs use model.predict() method to directly get the predicted label?
For example, what is the difference between "ypred_predfunc" and "ypred_pred_decisionf" in below code?
def generate_clf(self, input_C, input_gamma, X_train, y_train, X_test, y_test):
classifier = SVC(kernel='rbf', gamma=input_gamma, C=input_C)
classifier.fit(X_train, y_train)
ypred_predfunc = classifier.predict(X_test)
decision_values = classifier.decision_function(X_test)
ypred_pred_decisionf = []
for i in decision_values:
if i > 0:
i = 1
ypred_pred_decisionf.append(i)
else
i=0
ypred_pred_decisionf.append(i)
return classifier,ypred_predfunc,ypred_pred_decisionf
Is predict method considering 0th hyperplane for prediction from the below diagram?
So ive made a model for vehicle price prediction using linear regression.
And now i need to get it to predict prices for 5 years into the future.
how can i do it with clf.predict?
X = df[['Year','Engine','FuelType','Age','Transmission','Mileage']]
y = df['Price']
This is my X and y values
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
clf = LinearRegression()
clf.fit(X_train, y_train)
clf.predict(X_test)
print(clf.score(X_test, y_test))
After this i used,
clf.predict([[2022,1300,0,5,0,10000]])
and i got array([-5722871.63724422])
To get only positive values you should logarithm y values:
y_log = np.log(ytrain.values, where=(ytrain.values>0))
and make inverse transformation with model's predictions:
y_pred = exp(clf.predict(X_test))
I'm trying to use a gradient-boosting model to predict future scores in fantasy football - for now only looking at the 2 previous rounds. Currently, if a player is expected to score more than 6 points, the model would return '1', otherwise '0' - indicating whether the player would be a good captain choice or not.
In my original table i have player-name and round information to give context, but i removed these when training the algorithm. My question is, once the model makes a prediction - how can i show this prediction in combination with the player name, for example:
PlayerA - captain prediction = 1
etc.
y = ds.isCaptain
GB_table = ds.drop(['Player', 'Round', 'isCaptain', 'Points'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(GB_table, y, test_size=0.2)
baseline = GradientBoostingClassifier(learning_rate=0.01,n_estimators=1500,max_depth=4, min_samples_split=40, min_samples_leaf=7,max_features=4 , subsample=0.95, random_state=10)
baseline.fit(X_train,y_train)
predictors=list(X_train)
feat_imp = pd.Series(baseline.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Importance of Features')
plt.ylabel('Feature Importance Score')
print('Accuracy of GBM on test set: {:.3f}'.format(baseline.score(X_test, y_test)))
pred=baseline.predict(X_test)
print(classification_report(y_test, pred))
The above shows me the predicted results, but unfortunately since I removed the player name and round information from the GB_table, I can no longer understand from who/which round the prediction is made.
I'm assuming you are using pandas DataFrames, in which case it's quite straightforward.
The index numbers in your X_train and X_test DataFrames will correspond to the index in your original 'ds' DataFrame.
Try:
pred = baseline.predict(X_test)
pred_original_data = ds.iloc[X_test.index]
pred_original_data['prediction'] = pred
You could drop the player column and other fields after train_test_split.
Here is my suggestion
y = ds.isCaptain
X_train, X_test, y_train, y_test = train_test_split(ds, y, test_size=0.2)
baseline = GradientBoostingClassifier(learning_rate=0.01, n_estimators=1500,max_depth=4, min_samples_split=40, min_samples_leaf=7,max_features=4 , subsample=0.95, random_state=10)
baseline.fit(X_train.drop(['Player', 'Round', 'isCaptain', 'Points'], axis=1),y_train)
X_test_input = X_test.drop(['Player', 'Round', 'isCaptain', 'Points']
score = baseline.score(X_test_input, y_test))
print('Accuracy of GBM on test set: {:.3f}'.format(score)
X_test['prediction'] = baseline.predict(X_test_input)
print(classification_report(y_test, X_test['prediction']))
I am trying to train a linear regression model using sklearn to predict likes of given tweets. I have the following as features/ attributes.
['id', 'month', 'hour', 'text', 'hasMedia', 'hasHashtag', 'followers_count', 'retweet_count', 'favourite_count', 'sentiment', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust', ......keywords............]
I use tfidfvectorizer for extracting keywords. The problem is, depending on the size of the training data, the number of keywords differ and therefore, the number of independent attributes differ. Because of this there is a mismatch of attributes between training and testing data. I get ValueError: Shape of passed values is (1, 1678), indices imply (1, 1928).
It works fine when I split the same data into train and test and predict with test as below.
Program for training and prediction
def train_favourite_prediction(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# now you can save it to a file
joblib.dump(regressor, os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
return None
def predict_favourites(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
# and later you can load it
regressor = joblib.load(os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("the large training just finished")
return None
Code for fit vectorization
Have a look at Applying Tfidfvectorizer on list of pos tags gives ValueError to understand the format of my 'text' column.
def ready_for_training(dataset):
dataset = dataset.head(1000)
dataset['text'] = dataset.text.apply(lambda x: literal_eval(x))
dataset['text'] = dataset['text'].apply(
lambda row: [item for sublist in row for item in sublist])
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text'])
keyword_matrix = pd.DataFrame(keyword_response.todense(), columns=tfidf.get_feature_names())
keyword_matrix = keyword_matrix.loc[:, (keyword_matrix != 0).any(axis=0)]
dataset['sentiments'] = dataset['sentiments'].map(eval)
dataset = pd.concat([dataset.drop(['sentiments'], axis=1), dataset['sentiments'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['neg', 'neu','pos'], axis=1)
dataset['emotions'] = dataset['emotions'].map(eval)
dataset = pd.concat([dataset.drop(['emotions'], axis=1), dataset['emotions'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['id', 'month', 'text'], axis=1)
result = pd.concat([dataset, keyword_matrix], axis=1, sort=False)
return result
What I need is to predict 'favourite_count' when a new single Tweet is given. When I get the keywords for this tweet I get only a few. While training I trained with 1000+ keywords. I have stored the trained knowledge in a .pkl file. How should I handle this mismatch of attributes? To fill the missing columns in testing tweet as in Keep same dummy variable in training and testing data I may need the training set as a dataframe. But I have stored the trained knowledge as .pkl. and won't be able to access the columns in the trained knowledge.
I'm following one of the kernels on Kaggle, mainly, I'm following A kernel for Credit Card Fraud Detection.
I reached the step where I need to perform KFold in order to find the best parameters for Logistic Regression.
The following code is shown in the kernel itself, but for some reason (probably older version of scikit-learn, give me some errors).
def printing_Kfold_scores(x_train_data,y_train_data):
fold = KFold(len(y_train_data),5,shuffle=False)
# Different C parameters
c_param_range = [0.01,0.1,1,10,100]
results_table = pd.DataFrame(index = range(len(c_param_range),2), columns = ['C_parameter','Mean recall score'])
results_table['C_parameter'] = c_param_range
# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]
j = 0
for c_param in c_param_range:
print('-------------------------------------------')
print('C parameter: ', c_param)
print('-------------------------------------------')
print('')
recall_accs = []
for iteration, indices in enumerate(fold,start=1):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
# with indices[0]. We then predict on the portion assigned as the 'test cross validation' with indices[1]
lr.fit(x_train_data.iloc[indices[0],:],y_train_data.iloc[indices[0],:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1],:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[indices[1],:].values,y_pred_undersample)
recall_accs.append(recall_acc)
print('Iteration ', iteration,': recall score = ', recall_acc)
# The mean value of those recall scores is the metric we want to save and get hold of.
results_table.ix[j,'Mean recall score'] = np.mean(recall_accs)
j += 1
print('')
print('Mean recall score ', np.mean(recall_accs))
print('')
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
# Finally, we can check which C parameter is the best amongst the chosen.
print('*********************************************************************************')
print('Best model to choose from cross validation is with C parameter = ', best_c)
print('*********************************************************************************')
return best_c
The errors I'm getting are as follows:
for this line: fold = KFold(len(y_train_data),5,shuffle=False)
Error:
TypeError: init() got multiple values for argument 'shuffle'
if I remove the shuffle=False from this line, I'm getting the following error:
TypeError: shuffle must be True or False; got 5
If I remove the 5 and keep the shuffle=False, I'm getting the following error;
TypeError: 'KFold' object is not iterable
which is from this line: for iteration, indices in enumerate(fold,start=1):
If someone can help me with solving this issue and suggest how this can be done with the latest version of scikit-learn it will be very appreciated.
Thanks.
That depends on how you have imported the KFold.
If you have did this:
from sklearn.cross_validation import KFold
Then your code should work. Because it requires 3 params :- length of array, number of splits, and shuffle
But if you are doing this:
from sklearn.model_selection import KFold
then this will not work and you only need to pass the number of splits and shuffle. No need to pass the length of array along with making changes in enumerate().
By the way, the model_selection is the new module and recommended to use. Try using it like this:
fold = KFold(5,shuffle=False)
for train_index, test_index in fold.split(X):
# Call the logistic regression model with a certain C parameter
lr = LogisticRegression(C = c_param, penalty = 'l1')
# Use the training data to fit the model. In this case, we use the portion of the fold to train the model
lr.fit(x_train_data.iloc[train_index,:], y_train_data.iloc[train_index,:].values.ravel())
# Predict values using the test indices in the training data
y_pred_undersample = lr.predict(x_train_data.iloc[test_index,:].values)
# Calculate the recall score and append it to a list for recall scores representing the current c_parameter
recall_acc = recall_score(y_train_data.iloc[test_index,:].values,y_pred_undersample)
recall_accs.append(recall_acc)
KFold is a splitter, so you have to give something to split.
example code:
X = np.array([1,1,1,1], [2,2,2,2], [3,3,3,3], [4,4,4,4]])
y = np.array([1, 2, 3, 4])
# Now you create your Kfolds by the way you just have to pass number of splits and if you want to shuffle.
fold = KFold(2,shuffle=False)
# For iterate over the folds just use split
for train_index, test_index in fold.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Follow fitting the classifier
If you want to get the index for the loop of train/test, just add enumerate
for i, train_index, test_index in enumerate(fold.split(X)):
print('Iteration:', i)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I hope this works