My issue is similar to this question. But I didn't get the answer. I need further clarification.
I am using sklearn linear regression prediction -for the first time- to add more data points to my dataset. Adding more data points will help me identify outliers more accurately. I have built my model and got the predictions but I want the model to return predicted points with a certain range. Is it possible to achieve this?
I would like to predict values in a column called 'delivery_fee'.
The values in the column starts from 3 and increases steadily until it reaches 27.
The last value in the column and it comes right after 27 is 47.
I would like the model to predict values between 27 and 47.
my code:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
#create a copy of the dataframe
delivery_linreg = outlierFileNew.copy()
le = preprocessing.LabelEncoder()
delivery_linreg['branch_code'] = le.fit_transform(delivery_linreg['branch_code'])
#select all columns in the datframe except for delivery_fee
x = delivery_linreg[[x for x in delivery_linreg.columns if x != 'delivery_fee']]
#selecting delivery_fee as the column to be predicted
y = delivery_linreg.delivery_fee
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
#fitting simple linear regression to training set
linreg = LinearRegression()
linreg.fit(x_train,y_train)
delivery_predict = linreg.predict(x_test)
My model returns values that range from 4 to 17. Which is not the range I want. Any suggestions on how to change the predicted range?
Thank you,
Related
I'm trying to create a force_plot for my Random Forest model that has two classes (1 and 2), but I am a bit confused about the parameters for the force_plot.
I have two different force_plot parameters I can provide the following:
shap.force_plot(explainer.expected_value[0], shap_values[0], choosen_instance, show=True, matplotlib=True)
expected and shap values: 0
shap.force_plot(explainer.expected_value[1], shap_values[1], choosen_instance, show=True, matplotlib=True)
expected and shap values: 1
So my questions are:
When creating the force_plot, I must supply expected_value. For my model I have two expected values: [0.20826239 0.79173761], how do I know which to use? My understanding of expected value is that it is the average prediction of my model on train data. Are there two values because I have both class_1 and class_2? So for class_1, the average prediction is 0.20826239 and class_2, it is 0.79173761?
The next parameter is the shap_values, for my chosen instance:
index B G R Prediction
113833 107 119 237 2
I get the following SHAP_values:
[array([[ 0.01705462, -0.01812987, 0.23416978]]),
array([[-0.01705462, 0.01812987, -0.23416978]])]
I don't quite understand why I get two sets of SHAP values? Is one for class_1 and one for class_2? I have been trying to compare the images I attached, given both sets of SHAP values and expected value but I can't really explain what is going on in terms of the prediction.
Let's try reproducible:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from shap.maskers import Independent
from scipy.special import expit, logit
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
Then, your SHAP expected values are:
masker = Independent(data = X_train)
explainer = TreeExplainer(model, data=masker)
ev = explainer.expected_value
ev
array([0.35468973, 0.64531027])
This is what your model would predict on average given background dataset (fed to explainer above):
model.predict_proba(masker.data).mean(0)
array([0.35468973, 0.64531027])
Then, if you have a datapoint of interest:
data_to_explain = X_train[[0]]
model.predict_proba(data_to_explain)
array([[0.00470234, 0.99529766]])
You can achieve exactly the same with SHAP values:
sv = explainer.shap_values(data_to_explain)
np.array(sv).sum(2).ravel()
array([-0.34998739, 0.34998739])
Note, they are symmetrical, because what increase chances towards class 1 decreases chances for 0 by the same amount.
With base values and SHAP values, the probabilities (or chances for a data point to end up in leaf 0 or 1) are:
ev + np.array(sv).sum(2).ravel()
array([0.00470234, 0.99529766])
Note, this is same as model predictions.
I am currently using the scikit learn module in order to help with a crime prediction problem. I am having an issue batch coding the entire Dataframe that I have with the knn.predict method.
How can I batch code the entire two columns of my Dataframe with the knn.predict() method in order to store in another Dataframe the output?
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
knn_df = pd.read_csv("/Users/helenapunset/Desktop/knn_dataframe.csv")
# x is the set of features
x = knn_df[['latitude', 'longitude']]
# y is the target variable
y = knn_df['Class']
# train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
# training the data
knn.fit(x_train,y_train)
# test score was approximately 69%
knn.score(x_test,y_test)
# this is predicted to be a safe zone
crime_prediction = knn.predict([[25.787882, -80.358427]])
print(crime_prediction)
In the last line of the code I was able to add the two features I am using which are latitude and longitude from my Dataframe labeled knn_df. But, this is a single point I have been searching through the documentation on a process for streamlining this knn prediction for the entire Dataframe and cannot seem to find a way to do this. Is there somehow a possibility of using a for loop for this?
Let the new set to be predicted is 'knn_df_predict'. Assuming same column names,try the following lines of code :
x_new = knn_df_predict[['latitude', 'longitude']] #formating features
crime_prediction = knn.predict(x_new) #predicting for the new set
knn_df_predict['prediction'] = crime_prediction #Adding the prediction to dataframe
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new
I am trying to perform linear regression to predict an outcome y based on about ten independent variables X. I have 125 data points with data for y and each of X.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
X_array = X.values.reshape(-1,X.shape[1]) #converting dataframe into array
#125 x 9
y_array = y.values.reshape(-1,1)
#125 x 1
#Split data
X_train, X_test, y_train, y_test = train_test_split(X_array, y_array, test_size=0.2, random_state=100)
#define linear regression and fit to the data
regressor = LinearRegression()
regressor.fit(X_train, y_train)
r_sqr = regressor.score(X_test,y_test,sample_weight=None)
The problem is that when I run this code for just one variable within X (0.52), I get a better R-squared score than when I run it for all of X (0.33). How can this be? Shouldn't I get better performance of the model with more independent variables?
I've tried r2_score function and tried running train_test_split on the data in data frame for instead of array, but no changes.
I'm trying to learn scikit-learn and Machine Learning by using the Boston Housing Data Set.
# I splitted the initial dataset ('housing_X' and 'housing_y')
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(housing_X, housing_y, test_size=0.25, random_state=33)
# I scaled those two datasets
from sklearn.preprocessing import StandardScaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test)
# I created the model
from sklearn import linear_model
clf_sgd = linear_model.SGDRegressor(loss='squared_loss', penalty=None, random_state=42)
train_and_evaluate(clf_sgd,X_train,y_train)
Based on this new model clf_sgd, I am trying to predict the y based on the first instance of X_train.
X_new_scaled = X_train[0]
print (X_new_scaled)
y_new = clf_sgd.predict(X_new_scaled)
print (y_new)
However, the result is quite odd for me (1.34032174, instead of 20-30, the range of the price of the houses)
[-0.32076092 0.35553428 -1.00966618 -0.28784917 0.87716097 1.28834383
0.4759489 -0.83034371 -0.47659648 -0.81061061 -2.49222645 0.35062335
-0.39859013]
[ 1.34032174]
I guess that this 1.34032174 value should be scaled back, but I am trying to figure out how to do it with no success. Any tip is welcome. Thank you very much.
You can use inverse_transform using your scalery object:
y_new_inverse = scalery.inverse_transform(y_new)
Bit late to the game:
Just don't scale your y. With scaling y you actually loose your units. The regression or loss optimization is actually determined by the relative differences between the features. BTW for house prices (or any other monetary value) it is common practice to take the logarithm. Then you obviously need to do an numpy.exp() to get back to the actual dollars/euros/yens...