How can I graph in using matplotlib/scikit learn? - python

I am trying some code to make a learning curve :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7)
from sklearn.linear_model import LinearRegression
estimator = LinearRegression()
estimator.fit(X_train, y_train)
y_predicted = estimator.predict(X_test)
fig = plt.figure()
plt.xlabel("Data")
plt.ylabel("MSE")
plt.ylim(-4, 14)
plt.scatter(X_train.ravel(), y_train, color = 'green')#<<<<<<<ERROR HERE
plt.plot(X_test.ravel(), y_predicted, color = 'blue')
plt.show()
Results in :
ValueError: x and y must be the same size
Printing X_train and y_train shape output :
(1317, 11)
(1317,)
How can I fix this ?

The problem is that you are trying to plot an 11 dimensional variable (x) against a 1 dimensional variable (y). You say you are trying to plot a learning curve. This implies you are training a model iteratively and showing the error after each iteration (or 5 iterations, or whatever). But that is not what you are plotting, you are training the model fully, then trying to plot the inputs (or whatever ravel() does to them) against the predictions. This won't work. You need to rethink what you are trying to achieve here.

As already mentioned you are trying to plot the response variable against 11 features on a 2d grid, which clearly isn't going to work. None of my following suggestions are going to achieve what you are attempting, since your model isn't learning iteratively instead you split, trained, tested. However if your If you merely want to plot each successive feature against your response you could do something like the following (I used pandas to organize my data)
data = DataFrame(np.random.normal(0,1, (1317, 11)),
index=pd.date_range(
end= dt.datetime.utcnow(),
periods=1317, freq='D'))
features = ['feature_{}'.format(x) for x in
range(len(data.columns))]
data.columns = features
data['result'] = data.mean(1) + np.random.randn()
fig = plt.figure(figsize(10,10))
ax = fig.add_subplot(111)
for feature in features:
ax.scatter(data[feature], data['result'], c=numpy.random.rand(3,1))
Although I would probably just scatter your model (y_predicted) against y to visually validate my model.

Related

XGBoost can't predict a simple sinusoidal function

I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)
# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.

My train/test model is returning an error and is train/test model and normal linear regression model two separate models?

I recently attending a class where the instructor was teaching us how to create a linear regression model using Python. Here is my linear regression model:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import numpy as np
from sklearn.metrics import r2_score
#Define the path for the file
path=r"C:\Users\H\Desktop\Files\Data.xlsx"
#Read the file into a dataframe ensuring to group by weeks
df=pd.read_excel(path, sheet_name = 0)
df=df.groupby(['Week']).sum()
df = df.reset_index()
#Define x and y
x=df['Week']
y=df['Payment Amount Total']
#Draw the scatter plot
plt.scatter(x, y)
plt.show()
#Now we draw the line of linear regression
#First we want to look for these values
slope, intercept, r, p, std_err = stats.linregress(x, y)
#We then create a function
def myfunc(x):
#Below is y = mx + c
return slope * x + intercept
#Run each value of the x array through the function. This will result in a new array with new values for the y-axis:
mymodel = list(map(myfunc, x))
#We plot the scatter plot and line
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
#We print the value of r
print(r)
#We predict what the cost will be in week 23
print(myfunc(23))
The instructor said we now must use the train/test model to determine how accurate the model above is. This confused me a little as I understood it to mean we will further refine the model above. Or, does it simply mean we will use:
a normal linear regression model
a train/test model
and compare the r values the two different models yield as well as the predicted values they yield?. Is the train/test model considered a regression model?
I tried to create the train/test model but I'm not sure if it's correct (the packages were imported from the above example). When I run the train/test code I get the following error:
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
Here is the full code:
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
#I display the training set:
plt.scatter(train_x, train_y)
plt.show()
#I display the testing set:
plt.scatter(test_x, test_y)
plt.show()
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
myline = np.linspace(0, 6, 100)
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
#Let's look at how well my training data fit in a polynomial regression?
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
#Now we want to test the model with the testing data as well
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
#Now we can use this model to predict new values:
#We predict what the total amount would be on the 23rd week:
print(mymodel(23))
You better split to train and test using sklearn method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Where X is your features dataframe and y is the column of your labels. 0.2 stands for 80% train and 20% test.
BTW - the error you are describing could be because you dataframe has only 80 rows, leaving x[80:] empty

muliple linear regression, traing dataset graphs ,ValueError: x and y must be the same size

i am running following code, graph for training dataset is giving error,
import pandas as pd
import numpy as np
df = pd.read_csv('11.csv')
df.head()
AT V AP RH PE
0 8.34 40.77 1010.84 90.01 480.48
1 23.64 58.49 1011.40 74.20 445.75
2 29.74 56.90 1007.15 41.91 438.76
3 19.07 49.69 1007.22 76.79 453.09
4 11.80 40.66 1017.13 97.20 464.43
x = df.drop(['PE'], axis = 1).values
y = df['PE'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
ml = LinearRegression()
ml.fit(x_train, y_train)
y_pred = ml.predict(x_test)
print(y_pred)
import matplotlib.pyplot as plt
plt.scatter(x_train, y_train, color = 'red')
plt.plot(x_train, ml.predict(x_test), color = 'green')
plt.show() ***
please help to reshape 2d to 1d array for plotting graphs
**ValueError: x and y must be the same size**
EDIT: Now that your question has it's format fixed, I'm spotting a few errors, with a theme of using 1D linear regression code to plot your multiple regression problem.
plt.scatter(x_train, y_train, color = 'red'): You're trying to plot multiple variables in one axis (AT, V, AP, RH) using x_train. You cannot do this since this is multiple linear regression. (For example, one can't fit pressure and volume on the x-axis against temperature on the y. What does the x-axis represent? It doesn't make sense.) You cannot plot what you are trying to plot, and I cannot give you suggestions since I don't know what you're trying to plot. You can try one variable at a time, e.g. plt.scatter(x_train['AT'], y_train, color='red'). Or you use different color to plot each variable on the same graph - though I don't recommend this since your x-axis could be of different units.
plt.plot(x_train, ml.predict(x_test): You should be using y_test for your x-input. E.g. plt.plot(y_test, ml.predict(x_test)). This is a problem with the length of your data, not your width/columns like the error above. Though if my suggestion isn't what you wanted (it's a little strange to plot y_test and your y predictions), you might be inputting (incorrectly) assumptions/code for 1D linear regression when you're working with multiple linear regression - a potential theme in these errors.

scikit-learn - can a LinearRegression() learn something different to a straight line using one feature?

I am using scikit-learn's LinearRegression() with time-series data like
time_in_s value
1539015300000000000 2.061695
1539016200000000000 40.178125
1539017100000000000 12.276094
...
As it it a univariate case I expect my model to be a straight line like y=m*x+c. When I do
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.time_in_s,
df.value,
test_size=0.3,
random_state=0,
shuffle=False)
regressor = LinearRegression().fit(X_train, y_train)
y_pred_train = regressor.predict(X_train)
y_pred_test = regressor.predict(X_test)
[...]
I get a straight line asn expected: . If I use shuffle=True though, I get a curve .
I am struggling to understand what shuffle does here and why can I learn something different than a straight line with one feature. I'd appreciate a hint.
EDIT: Here are the model's attributes
>>> #shuffle=False
>>> print(f"{regressor.coef_}")
[-1.6e-16]
>>> print(f"{regressor.intercept_}")
272.0575589244862
>>> #shuffle=True
>>> print(f"{regressor.coef_}")
[-7.76e-17]
>>> print(f"{regressor.intercept_}")
143.9711420915541
And for plotting:
start = 61000
stop = 61500
fig, ax1 = plt.subplots(figsize=(15, 5))
color='tab:red'
plt.plot(df.index[start:train_length].values.reshape(-1, 1),
df.value[start:train_length].values.reshape(-1, 1),
color=color)
color='tab:blue'
plt.plot(df.index[train_length:stop].values.reshape(-1, 1),
df.value[train_length:stop].values.reshape(-1, 1),
color=color)
color='tab:green'
plt.plot(df.index[start:train_length].values.reshape(-1, 1),
y_pred_train[start:],
color=color,
linestyle='dashed')
plt.plot(df.index[train_length:stop].values.reshape(-1, 1),
y_pred_test[:stop - train_length],
color=color,
linestyle='dashed')
ax1.tick_params(axis='y')
ax1.tick_params(axis='x')
You are not getting a curve. If you check the help page for train_test_split, it writes:
shuffle bool, default=True Whether or not to shuffle the data before
splitting. If shuffle=False then stratify must be None.
I assumed your data is ordered according to df.time_in_s, so you are running a regression model on the first 70% on your data and predicting on the last 30%, if you don't shuffle.
With shuffle=True, the order of the rows are swapped and you are taking a random 70% of the data and predicting on another 30% of the data, without considering the time. You did not show the code for plotting, but my guess is you plotted the original data frame, with ordered time, and just placed the predictions on top, hence you get this fuzzy line.
Can you try printing out your regressor.coef_ and regressor.intercept_ for both cases?
Also how are you plotting the data?
Linear regression can only give you 1 weight and 1 bias if your input is 1D. The shuffle parameter only shuffles the data you pass it, this cannot make your model higher dimensional.

Plotting prediction from logistic regression

I would like to plot y_test and prediction in a scatter plot.
I am using the logistic regression as model.
from sklearn.linear_model import LogisticRegression
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['Spam'])
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=27)
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
pred_log = lr.predict(X_test)
I have tried as follows
## Plot the model
plt.scatter(y_test, pred_log)
plt.xlabel("True Values")
plt.ylabel("Predictions")
and I got this:
that I do not think it is what I should expect.
y_test is (250,), similarly pred_log is (250,)
Am I considering the wrong variables to plot, or they are right?
I have no idea one what the plot with those four values mean. I would have been expected more dots in the plot, but maybe I am wrong.
Please let me know if you need more info. Thanks
I think you know LogisticRegression is a classification algorithm. If you do binary classification it will predict whether predicted class is 0 or 1.If you want to get visualization about how model preform, you should consider confusion matrix.You can't use scatterplot for visualize classification results.
import seaborn as sns
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cf_matrix, annot=True)
confusion matrix shows how many labels have correct predictions and how many are wrong.Looking at confusion matrix you can calculate how accurate the model.We can use different metrices like precision,recall and F1 score.

Categories