I'm currently working on a time-series model. It's very simple. I'm deploying last row OHLC (open, high, low, close) value and trying to predict next close. Simple and useless. But what i want to do is to give last 10 days to predict tomorrow's price. I know it's not going to be accurate but this is what i am trying to do.
Here how i get the NextClose and apply it to Linear Regression model:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
df = pd.read_csv("./EURUSD.csv")
days = 1
df['NextClose'] = df['Close'].shift(-days)
df = df.dropna()
total = len(df)
test_ratio = 0.30
test_size = int(total * test_ratio)
total = len(df)
test_ratio = 0.30
test_size = int(total * test_ratio)
X = df[['Open', 'High', 'Low', 'Close']]
y = df[['NextClose']]
#build test and train data
X_train = X[:-test_size]
y_train = y[:-test_size]
X_test = X[-test_size:]
y_test = y[-test_size:]
# build model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
plt.scatter(y_pred, y_test)
plt.show()
In this case, i am giving only the final row. What I want to do is to feed last 10-20 rows.
I believe this is a similar data transformation to the one described in the function create_dataset on this MachineLearningMastery page (see the section LSTM for Regression Using the Window Method).
The goal is to use the data from rows t:(t+days) to predict the closing price at row (t+days)+1.
The X_train matrix will have days * X.shape[1] columns in each row, which in the example below represents the flattened data from 10 days worth of data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# generate random data to test
df = pd.DataFrame(np.random.normal(size=(2000, 4)))
df.columns = ['Open', 'High', 'Low', 'Close']
days = 10
df = df.dropna()
total = len(df)
test_ratio = 0.30
test_size = int(total * test_ratio)
X = df[['Open', 'High', 'Low', 'Close']]
y = df['Close'].shift(-days)
# this function based on the MachineLearningMastery page mentioned
def create_dataset(X, y, look_back=1):
dataX, dataY = [], []
for i in range(X.shape[0]-look_back):
a = X.iloc[i:(i+look_back), :].values.flatten()
dataX.append(a)
dataY.append(y.iloc[i])
return np.array(dataX), np.array(dataY)
#build test and train data
X_train, y_train = create_dataset(X[:-test_size], y[:-test_size], look_back=days)
X_test, y_test = create_dataset(X[-test_size:], y[-test_size:], look_back=days)
# build model
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
plt.scatter(y_pred, y_test)
plt.show()
Related
Hi I'm working on a learning model and I seem to have been stuck with the forecast_out function of Scikit-learn.
I need help in creating forecasts not just for daily, but hourly, and even by the minute as well.
Thanks!
import pandas as pd
import datetime as dt
import pandas_datareader as reader
import fbprophet
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
end = dt.datetime.now()
start = dt.datetime(end.year-1,end.month,end.day,end.hour,end.minute)
start
df = reader.get_data_yahoo('BTC-USD',start,end)
print(df.head())
df = df[['Adj Close']]
print(df.head())
forecast_out =30 #(I want to be able to predict hourly and even by the minute as well)
df['Prediction'] = df[['Adj Close']].shift(-forecast_out)
print(df.tail())
X = np.array (df.drop(['Prediction'],1))
X = X[:-forecast_out]
print (X)
y = np.array (df['Prediction'])
y = y[:-forecast_out]
print(y)
x_train, x_test, y_train, y_test = train_test_split (X,y, test_size=0.2)
svr_rbf = SVR(kernel='rbf' , C=1e3, gamma=0.1)
svr_rbf.fit(x_train, y_train)
svm_confidence = svr_rbf.score(x_test, y_test)
print ("svm confidence: ", svm_confidence)
lr = LinearRegression()
lr.fit(x_train, y_train)
lr_confidence = lr.score(x_test, y_test)
print ("lr confidence: ", lr_confidence)
x_forecast = np.array(df.drop(['Prediction'],1))[-forecast_out:]
print(x_forecast)
lr_prediction = lr.predict (x_forecast)
print(lr_prediction)
svm_prediction = svr_rbf.predict (x_forecast)
print(svm_prediction)
I am learning time series forecast. Here is the code and it is working fine, the prediction is always comparing to test data. Now given the today's stock price from X, how can I predict tomorrow's stock Y( MSFT)price? This is the main goal of prediction. Thanks
import numpy as np
import pandas as pd
import pandas_datareader.data as web
# Error Metrics
from sklearn.metrics import mean_squared_error
# Time series Models
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot
from datetime import datetime, timedelta
#Diable the warnings
import warnings
warnings.filterwarnings('ignore')
stk_tickers = ['MSFT', 'IBM', 'GOOGL']
ccy_tickers = ['DEXJPUS', 'DEXUSUK']
idx_tickers = ['SP500', 'DJIA', 'VIXCLS']
stk_data = web.DataReader(stk_tickers, 'yahoo',start_date, end_date)
ccy_data = web.DataReader(ccy_tickers, 'fred')
idx_data = web.DataReader(idx_tickers, 'fred')
return_period = 5
Y = np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(return_period).shift(-return_period)
Y.name = Y.name[-1]+'_pred'
X1 = np.log(stk_data.loc[:, ('Adj Close', ('GOOGL', 'IBM'))]).diff(return_period)
X1.columns = X1.columns.droplevel()
X2 = np.log(ccy_data).diff(return_period)
X3 = np.log(idx_data).diff(return_period)
X4 = pd.concat([np.log(stk_data.loc[:, ('Adj Close', 'MSFT')]).diff(i) for i in [return_period, return_period*3, return_period*6, return_period*12]], axis=1).dropna()
X4.columns = ['MSFT_DT', 'MSFT_3DT', 'MSFT_6DT', 'MSFT_12DT']
X = pd.concat([X1, X2, X3, X4], axis=1)
dataset = pd.concat([Y, X], axis=1).dropna().iloc[::return_period, :]
Y = dataset.loc[:, Y.name]
X = dataset.loc[:, X.columns]
validation_size = 0.2
#In case the data is not dependent on the time series, then train and test split randomly
# seed = 7
# X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)
#In case the data is not dependent on the time series, then train and test split should be done based on sequential sample
#This can be done by selecting an arbitrary split point in the ordered list of observations and creating two new datasets.
train_size = int(len(X) * (1-validation_size))
X_train, X_test = X[0:train_size], X[train_size:len(X)]
Y_train, Y_test = Y[0:train_size], Y[train_size:len(X)]
X_train_ARIMA=X_train.loc[:, ['GOOGL', 'IBM', 'DEXJPUS', 'SP500', 'DJIA', 'VIXCLS']]
X_test_ARIMA=X_test.loc[:, ['GOOGL', 'IBM', 'DEXJPUS', 'SP500', 'DJIA', 'VIXCLS']]
tr_len = len(X_train_ARIMA)
te_len = len(X_test_ARIMA)
to_len = len (X)
modelARIMA=ARIMA(endog=Y_train,exog=X_train_ARIMA,order=[2,0,1])
model_fit = modelARIMA.fit()
error_Training_ARIMA = mean_squared_error(Y_train, model_fit.fittedvalues)
predicted = model_fit.predict(start = tr_len -1 ,end = to_len -1, exog = X_test_ARIMA)[1:]
Not alot of wisdom here... But I have a script that will compile and test the algorithm two times with the for i in range loop to see if there is any variation in root mean squared error.
Is it possible to modify the code where the loop will work to test two different datasets? IE, a df would run first one time compile rmse and then a df2 could run compile rmse and then I can compare/print the rmse between the two.. Both datasets would have the same ['Demand'] as the response variable.
#Test random Forest
import numpy as np
from sklearn import preprocessing, neighbors
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.externals import joblib
import math
rmses = []
for i in range(2):
X = np.array(df2.drop(['Demand'],1))
y = np.array(df2['Demand'])
offset = int(X.shape[0] * 0.7)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
clf = RandomForestRegressor(n_estimators=60, min_samples_split=6)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
rmse = math.sqrt(mse)
print("rmse: %.4f" % rmse)
rmses.append(rmse)
print(sum(rmses)/len(rmses))
You can create a list of dfs and iterate over that:
rmses = []
df_lst = [df1, df2]
for df in df_lst:
X = np.array(df.drop(['Demand'],1))
y = np.array(df['Demand'])
offset = int(X.shape[0] * 0.7)
X_train, y_train = X[:offset], y[:offset]
X_test, y_test = X[offset:], y[offset:]
clf = RandomForestRegressor(n_estimators=60, min_samples_split=6)
clf.fit(X_train, y_train)
mse = mean_squared_error(y_test, clf.predict(X_test))
rmse = math.sqrt(mse)
print("rmse: %.4f" % rmse)
rmses.append(rmse)
print(sum(rmses)/len(rmses))
You could use an auxiliar df and assign the dataframe you want to compile on each iteration by using a condition:
for i in range(2):
if i==1:
aux_df = df
else:
aux_df = df2
.
.
.
That way you can use the first df in the first iteration and df2 in the second iteration.
I tried to develop a model that foresees two time-steps forward
In this regard I modified a GitHub code for the single step forecast coding a data_load function that takes n steps backward in the X_train/test series and set it against a y_train/test 2-array.
I set the neurons list to output in Dense a 2-vector object.
And last I wrote a predict function and a plot function for the 2-step-forecast.
I do not normalized features lables and forecasts I will do in the future.
After a bit of hyperfine tuning it returns a good score for the mse and rmse:
Train Score: 0.00000 MSE (0.00 RMSE)
Test Score: 0.00153 MSE (0.04 RMSE)
It can find quite well the trend, but it returns all forecasts with negative directions.
Does anyone have a suggestion?
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt2
import pandas as pd
from pandas import datetime
import math, time
import itertools
from sklearn import preprocessing
import datetime
from sklearn.metrics import mean_squared_error
from math import sqrt
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.recurrent import LSTM
from keras.models import load_model
import keras
from numpy import newaxis
pd.core.common.is_list_like = pd.api.types.is_list_like
import pandas_datareader.data as web
import h5py
from keras import backend as K
import quandl
quandl.ApiConfig.api_key = 'myQuandlKey'
seq_len = 2
shape = [seq_len, 9, 2]
neurons = [256, 256, 64, 2]
dropout = 0.2
decay = 0.5
epochs = 100
stock_name = 'AAPL'
global_start_time = time.time()
def get_stock_data(stock_name, normalize=True, ma=[]):
"""
Return a dataframe of that stock and normalize all the values.
(Optional: create moving average)
"""
df = quandl.get_table('WIKI/PRICES', ticker = stock_name)
df.drop(['ticker', 'open', 'high', 'low', 'close', 'ex-dividend', 'volume', 'split_ratio'], 1, inplace=True)
df.set_index('date', inplace=True)
# Renaming all the columns so that we can use the old version code
df.rename(columns={'adj_open': 'Open', 'adj_high': 'High', 'adj_low': 'Low', 'adj_volume': 'Volume', 'adj_close': 'Adj Close'}, inplace=True)
# Percentage change
df['Pct'] = df['Adj Close'].pct_change()
df.dropna(inplace=True)
# Moving Average
if ma != []:
for moving in ma:
df['{}ma'.format(moving)] = df['Adj Close'].rolling(window=moving).mean()
df.dropna(inplace=True)
if normalize:
min_max_scaler = preprocessing.MinMaxScaler()
df['Open'] = min_max_scaler.fit_transform(df.Open.values.reshape(-1,1))
df['High'] = min_max_scaler.fit_transform(df.High.values.reshape(-1,1))
df['Low'] = min_max_scaler.fit_transform(df.Low.values.reshape(-1,1))
df['Volume'] = min_max_scaler.fit_transform(df.Volume.values.reshape(-1,1))
df['Adj Close'] = min_max_scaler.fit_transform(df['Adj Close'].values.reshape(-1,1))
df['Pct'] = min_max_scaler.fit_transform(df['Pct'].values.reshape(-1,1))
if ma != []:
for moving in ma:
df['{}ma'.format(moving)] = min_max_scaler.fit_transform(df['{}ma'.format(moving)].values.reshape(-1,1))
# Move Adj Close to the rightmost for the ease of training
adj_close = df['Adj Close']
df.drop(labels=['Adj Close'], axis=1, inplace=True)
df = pd.concat([df, adj_close], axis=1)
#df.to_csv('aap.csv')
return df
df = get_stock_data(stock_name, ma=[50, 100, 200])
def plot_stock(df):
print(df.head())
plt.subplot(211)
plt.plot(df['Adj Close'], color='red', label='Adj Close')
plt.legend(loc='best')
plt.subplot(212)
plt.plot(df['Pct'], color='blue', label='Percentage change')
plt.legend(loc='best')
plt.show()
#plot_stock(df)
def load_data(stock, seq_len):
amount_of_features = len(stock.columns)
data = stock.values
sequence_length = seq_len + 2 # index starting from 0
result = []
for index in range(len(data) - sequence_length): # maxmimum date = lastest date - sequence length
result.append(data[index: index + sequence_length]) # index : index + 22days
result = np.array(result)
row = round(0.8 * result.shape[0]) # 80% split
train = result[:int(row), :,:] # 80% date
X_train = train[:, :-2,:] # all data until day m
y_train = train[:, -2:,:][:,:,-1] # day m + 1 adjusted close price
X_test = result[int(row):, :-2,:]
y_test = result[int(row):, -2:,:][:,:,-1]
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], amount_of_features))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], amount_of_features))
print ("________________________________________________________________")
print ("X_train shape = {}".format(X_train.shape))
print ("y_train shape = {}".format(y_train.shape))
print ("")
print ("X_test shape = {}".format(X_test.shape))
print ("y_test shape = {}".format(y_test.shape))
print ("________________________________________________________________")
return [X_train, y_train, X_test, y_test]
X_train, y_train, X_test, y_test = load_data(df, seq_len)
def build_model(shape, neurons, dropout, decay):
model = Sequential()
model.add(LSTM(neurons[0], input_shape=(shape[0], shape[1]), return_sequences=True))
model.add(Dropout(dropout))
model.add(LSTM(neurons[1], input_shape=(shape[0], shape[1]), return_sequences=False))
model.add(Dropout(dropout))
model.add(Dense(neurons[2],kernel_initializer="uniform",activation='relu'))
model.add(Dense(neurons[3],kernel_initializer="uniform",activation='linear'))
# model = load_model('my_LSTM_stock_model1000.h5')
adam = keras.optimizers.Adam(decay=decay)
model.compile(loss='mse',optimizer='adam', metrics=['accuracy'])
model.summary()
return model
model = build_model(shape, neurons, dropout, decay)
model.fit(
X_train,
y_train,
batch_size=512,
epochs=epochs,
validation_split=0.01,
verbose=1)
def model_score(model, X_train, y_train, X_test, y_test):
trainScore = model.evaluate(X_train, y_train, verbose=0)
print('Train Score: %.5f MSE (%.2f RMSE)' % (trainScore[0], math.sqrt(trainScore[0])))
testScore = model.evaluate(X_test, y_test, verbose=0)
print('Test Score: %.5f MSE (%.2f RMSE)' % (testScore[0], math.sqrt(testScore[0])))
return trainScore[0], testScore[0]
model_score(model, X_train, y_train, X_test, y_test)
def percentage_difference(model, X_test, y_test):
percentage_diff=[]
p = model.predict(X_test)
for u in range(len(y_test)): # for each data index in test data
pr = p[u][0] # pr = prediction on day u
percentage_diff.append((pr-y_test[u]/pr)*100)
print('Prediction duration: ', str(datetime.timedelta(seconds=(time.time() - global_start_time))) )
return p
def denormalize(stock_name, normalized_value):
"""
Return a dataframe of that stock and normalize all the values.
(Optional: create moving average)
"""
df = quandl.get_table('WIKI/PRICES', ticker = stock_name)
df.drop(['ticker', 'open', 'high', 'low', 'close', 'ex-dividend', 'volume', 'split_ratio'], 1, inplace=True)
df.set_index('date', inplace=True)
# Renaming all the columns so that we can use the old version code
df.rename(columns={'adj_open': 'Open', 'adj_high': 'High', 'adj_low': 'Low', 'adj_volume': 'Volume', 'adj_close': 'Adj Close'}, inplace=True)
df.dropna(inplace=True)
df = df['Adj Close'].values.reshape(-1,1)
normalized_value = normalized_value.reshape(-1,1)
#return df.shape, p.shape
min_max_scaler = preprocessing.MinMaxScaler()
a = min_max_scaler.fit_transform(df)
new = min_max_scaler.inverse_transform(normalized_value)
return new
def plot_result(stock_name, normalized_value_p, normalized_value_y_test):
newp = denormalize(stock_name, normalized_value_p)
newy_test = denormalize(stock_name, normalized_value_y_test)
#newy_test = np.roll(newy_test,1,0)
plt2.plot(newp, color='red', label='Prediction')
plt2.plot(newy_test,color='blue', label='Actual')
plt2.legend(loc='best')
plt2.title('Global run time {}'.format(str(datetime.timedelta(seconds=(time.time() - global_start_time))) ) )
plt2.xlabel('Days')
plt2.ylabel('Adjusted Close')
plt2.show()
def predict_sequences_multiple(model, data, window_size, prediction_len):
#Predict sequence of 50 steps before shifting prediction run forward by 50 steps
prediction_seqs = []
for i in range(int(len(data)/prediction_len)):
curr_frame = data[i*prediction_len]
predicted = []
for j in range(prediction_len):
predicted.append(model.predict(curr_frame[newaxis,:,:])[0,0])
curr_frame = curr_frame[1:]
curr_frame = np.insert(curr_frame, [window_size-1], predicted[-1], axis=0)
prediction_seqs.append(predicted)
print('Prediction duration: ', str(datetime.timedelta(seconds=(time.time() - global_start_time))) )
return prediction_seqs
def plot_results_multiple(predicted_data, true_data, prediction_len):
fig = plt.figure(facecolor='white')
ax = fig.add_subplot(111)
ax.plot(true_data, label='True Data')
#Pad the list of predictions to shift it in the graph to it's correct start
for i, data in enumerate(predicted_data):
padding = [None for p in range(i * prediction_len)]
plt.plot(padding + data, label='Prediction')
plt.legend()
plt.title('Global run time {}'.format(str(datetime.timedelta(seconds=(time.time() - global_start_time))) ) )
plt.show()
#Single step prediction
#p = percentage_difference(model, X_test, y_test)
#plot_result(stock_name, p, y_test)
#Multiple step prediction
predictions = predict_sequences_multiple(model, X_test, seq_len, 2)
plot_results_multiple(predictions, y_test, 2)
Below is what i have done so far.
#importing the necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LassoCV
from sklearn.linear_model import ElasticNetCV
from sklearn.ensemble import RandomForestRegressor
filepath = r"C:\Users...Kaggle data\house prediction iowa\house_predtrain (3).csv"
train = pd.read_csv(filepath)
print(train.shape)
filepath2 = r"C:\Users...Kaggle data\house prediction iowa\house_predtest (1).csv"
test = pd.read_csv (filepath2)
print(test.shape)
#first we raplace all the NANs by 0 in botht the train and test data
train = train.fillna(0)
test = test.fillna(0) #error one
train.dtypes.value_counts()
#isolating all the object/categorical feature and converting them to numeric features
encode_cols = train.dtypes[train.dtypes == np.object]
encode_cols2 = test.dtypes[test.dtypes == np.object]
#print(encode_cols)
encode_cols = encode_cols.index.tolist()
encode_cols2 = encode_cols2.index.tolist()
print(encode_cols2)
# Do the one hot encoding
train_dummies = pd.get_dummies(train, columns=encode_cols)
test_dummies = pd.get_dummies(test, columns=encode_cols2)
#align your test and train data (error2)
train, test = train_dummies.align(test_dummies, join = 'left', axis = 1)
print(train.shape)
print(test.shape)
#Now working with Floats features
numericals_floats = train.dtypes == np.float
numericals = train.columns[numericals_floats]
print(numericals)
#we check for skewness in the float data
skew_limit = 0.35
skew_vals = train[numericals].skew()
skew_cols = (skew_vals
.sort_values(ascending=False)
.to_frame()
.rename(columns={0:'Skewness'}))
skew_cols
#Visualising them above data before and after log transforming
%matplotlib inline
field = 'GarageYrBlt'
fig, (ax_before, ax_after) = plt.subplots(1, 2, figsize=(10,5))
train[field].hist(ax=ax_before)
train[field].apply(np.log1p).hist(ax=ax_after)
ax_before.set (title = 'Before np.log1p', ylabel = 'frequency', xlabel = 'Value')
ax_after.set (title = 'After np.log1p', ylabel = 'frequency', xlabel = 'Value')
fig.suptitle('Field: "{}"'.format (field));
#note how applying log transformation on GarageYrBuilt does not do much
print(skew_cols.index.tolist()) #returns a list of the values
for i in skew_cols.index.tolist():
if i == "SalePrice": #we do not want to transform the feature to be predicted
continue
train[i] = train[i].apply(np.log1p)
test[i] = test[i].apply(np.log1p)
feature_cols = [x for x in train.columns if x != ('SalePrice')]
X_train = train[feature_cols]
y_train = train['SalePrice']
X_test = test[feature_cols]
y_test = train['SalePrice']
print(X_test.shape)
print(y_train.shape)
print(X_train.shape)
#now to the most fun part. Feature engineering is over!!!
#i am going to use linear regression, L1 regularization, L2 regularization and ElasticNet(blend of L1 and L2)
#first up, Linear Regression
alphas =[0.00005, 0.0005, 0.005, 0.05, 0.5, 0.1, 0.3, 1, 3, 5, 10, 25, 50, 100] #i choosed this
l1_ratios = np.linspace(0.1, 0.9, 9)
#LinearRegression
linearRegression = LinearRegression().fit(X_train, y_train)
prediction1 = linearRegression.predict(X_test)
LR_score = linearRegression.score(X_train, y_train)
print(LR_score)
#ridge
ridgeCV = RidgeCV(alphas=alphas).fit(X_train, y_train)
prediction2 = ridgeCV.predict(X_test)
R_score = ridgeCV.score(X_train, y_train)
print(R_score)
#lasso
lassoCV = LassoCV(alphas=alphas, max_iter=1e2).fit(X_train, y_train)
prediction3 = lassoCV.predict(X_test)
L_score = lassoCV.score(X_train, y_train)
print(L_score)
#elasticNetCV
elasticnetCV = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, max_iter=1e2).fit(X_train, y_train)
prediction4 = elasticnetCV.predict(X_test)
EN_score = elasticnetCV.score(X_train, y_train)
print(EN_score)
from sklearn.ensemble import RandomForestRegressor
randfr = RandomForestRegressor()
randfr = randfr.fit(X_train, y_train)
prediction5 = randfr.predict(X_test)
print(prediction5.shape)
RF_score = randfr.score(X_train, y_train)
print(RF_score)
#putting it lall together
rmse_vals = [LR_score, R_score, L_score, EN_score, RF_score]
labels = ['Linear', 'Ridge', 'Lasso', 'ElasticNet', 'RandomForest']
rmse_df = pd.Series(rmse_vals, index=labels).to_frame()
rmse_df.rename(columns={0: 'SCORES'}, inplace=1)
rmse_df
\\KaggleHouse_submission_1 = pd.DataFrame({'Id': test.Id, 'SalePrice': prediction5})
KaggleHouse_submission_1 = KaggleHouse_submission_1
print(KaggleHouse_submission_1.shape)
In the kaggle house prediction there is a train dataset and a test dataset. here is the link to the actual data link. The output dataframe size should be a 1459 X 2 but mine is 1460 X 2 for some reason. I am not sure why this is happening. Any feedbacks is highly appreciated.
In the following line:
test = train.fillna(0)
you are assigning (overwriting) test variable with the "train" data ...
Scikit learn is very sensitive o ordering of columns, so if your train data set and the test data set are misaligned, you may have a problem similar to that above. so you need to first ensure that the test data is encoded same as the train data by using the following align command.
train, test = train_dummies.align(test_dummies, join='left', axis = 1)
see changes in my code above