I'm trying to predict what a 15-minute delay in flight departure does to the flight's arrival time. I have thousands of rows as well as several columns in a DF. Two of these columns are dep_delay and arr_delay for departure delay and arrival delay. I have built a simple LinearRegression model:
y = nyc['dep_delay'].values.reshape((-1, 1))
arr_dep_model = LinearRegression().fit(y, nyc['arr_delay'])
And now I'm trying to find out the predicted arrival delay if the flights departure was delayed 15 minutes. How would I use the model above to predict what the arrival delay would be?
My first thought was to use a for loop / if statement, but then I came across .predict() and now I'm even more confused. Does .predict work like a boolean, where I would use "if departure delay is equal to 15, then arrival delay equals y"? Or is it something like:
arr_dep_model.predict(y)?
When working with LinearRegression models in sklearn you need to perform inference with the predict() function. But you also have to ensure the input you pass to the function has the correct shape (the same as the training data). You can learn more about the proper use of predict function in the official documentation.
arr_dep_model.predict(youtInput)
This line of code would output a value that the model predicted for a corresponding input. You can insert this into a for loop and traverse a set of values to serve as the model's input, it depends on the needs for your project and the data you are working with.
Hi Check below code for an example:`
import pandas as pd
import random
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'x1':random.choices(range(0, 100), k=10), 'x2':random.choices(range(0, 100), k=10)})
df['y'] = df['x2'] * .5
X_train = df[['x1','x2']][:-3].values #Training on top 7 rows
y_train = df['y'][:-3].values #Training on top 7 rows
X_test = df[['x1','x2']][-3:].values # Values on which prediction will happen - bottom 3 rows
regr = LinearRegression()
regr.fit(X_train, y_train)
regr.predict(X_test)
If you will notice X_test the data on which prediction is happening is of same shape as (number of columns) as X_train both have two columns ['X1','X2']. Same has been converted in array when .values is used. You can create your own data (2 column dataframe for current example) & can use that for prediction (because 3rd column is need to be predicted).
Output will be three values as predicted on three rows:
Related
I have a Fan Speed (RPM) dataset of 192.405 Values (train+test values). I am training the ARIMA model and trying to predict the rest of the future values of our dataset and comparing the results.
While fitting the model in test data I am getting straight line for predictions
from sklearn.model_selection import train_test_split
from statsmodels.tsa.arima_model import ARIMA
dfx = df[(df['Tarih']>'2020-07-23') & (df['Tarih']<'2020-10-23')]
X_train = dfx[:int(dfx.shape[0]*0.8)] #2 months
X_test = dfx[int(dfx.shape[0]*0.8):] # rest, 1 months
model = ARIMA(X_train.Value, order=(4,1,4))
model_fit = model.fit(disp=0)
print(model_fit.summary())
test = X_test
train = X_train
What could i do now ?
Your ARIMA model uses the last 4 observations to make a prediction. The first prediction will be based on the four last known data points. The second prediction will be based on the first prediction and the last three known data points. The third prediction will be based on the first and second prediction and the last two known data points and so on. Your fifth prediction will be based entirely on predicted values. The hundredth prediction will be based on predicted values based on predicted values based on predicted values … Each prediction will have a slight deviation from the actual values. These prediction errors accumulate over time. This often leads to ARIMA simply prediction a straight line when you try to predict such large horizons.
If your model uses the MA component, represented by the q parameter, then you can only predict q steps into the future. That means your model is only able to predict the next four data points, after that the prediction will converge into a straight line.
I am currently trying to use RFE in python to select features for a stock prediction project. Since this is a regression task (predicting close price of next 10 days) I am using DecisionTreeRegressor as my model in RFE. Below is my code:
def performRFE(x, y, minFeatures):
rfecv = RFECV(
estimator=DecisionTreeRegressor(),
step=1,
cv=6,
min_features_to_select=minFeatures
)
rfecv.fit(x, y)
x = rfecv.transform(x)
// printing which features got selected
rfeOutput = list(zip(cols, list(rfecv.support_)))
for feature in rfeOutput:
print("{} : {}".format(feature[0], feature[1]))
I want to know if I am doing it right because for the same stock and the same set of original features, RFE gives a different result (picks a different subset of features) every time I run it. I plan to use the output of RFE in an LSTM model to predict the close price. Here x contains data such open,close,high,low,volume,EMAs, SMAs, RSI, etc. Is it normal to get different results from RFE each time?
I have a large dataset of 23k rows. That data looks like something below:
import pandas as pd
d = {'Date': ["1-1-2020", '1-1-2020', "1-2-2020", "1-2-2020"], 'Stock': ["FB", "F", "FB", "F"],
"last_price": [230,8,241,9], "price":[241,9,240,8.5]}
df = pd.DataFrame(data=d)
Date Stock_id last_price price
0 1-1-2020 5 230 241.0
1 1-1-2020 41 8 9.0
2 1-2-2020 5 241 240.0
3 1-2-2020 41 9 8.5
Note that data includes many stocks on many different dates. How can I create a model that uses the feature for example last_price and stock id to predict next-day price? And that uses the old data to re-train the data.
Now, this was the best thing I could do. I used LinearRegression but any other model advice can work.
X = df[['Stock_id', 'last_price']]
y = df[['price']]
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import linear_model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
Index Actual Predicted
487 45 32
4154 420 512
Is there a way where the model is trained on the first 3000 rows? Then the model makes a prediction for say date 12-11-2020 and then adds 12-11-2020 info to make the prediction for 12-12-2020 and so on?
I was hoping to get something like this.
Date Actual Predicted
12-11-2020 45 32
12-11-2020 420 512
12-12-2020 43 34
12-12-2020 423 513
I don't think having the id in your training dataset is appropriate since ids and comparing them does not give any useable information and may result in a bad calculated linear function for your model. ID just signifies that you are talking about a specific stock and is constant for a specific stock in the whole dataset. Also the value of the Stock_id cannot does not have any meaning that can be used for comparing stocks together, for example having a Stock_id = 1 and Stock_id = 2 doesn't mean these 2 are closer together than Stock_id = 1 and Stock_id = 100, they are just names. So I think you should split your original dataset based on the Stock_id and only include last_price in each of these new training datasets (X). You can do that in several ways, one them being the groupby function of pandas:
grouped = df.groupby(df.Stock_id)
stock_1= grouped.get_group(1)
After that, you can use a for loop on the unique value of your Stock_id column to get all the ids and their dataframes. Then you define a regression model for each of these new datasets and use the fit method to train it.
To retrain or update your regression model, LinearRegression does not support partial fit and I think you need to use the fit method again each time you want to update your model. You can use the first N rows of each user to fit the model, then predict the value for the next last_price and add the predicted value to the N rows and re-fit the model on the extended dataset. However, if your model actually calculates a good line to predict the data, I don't think you will see that much of a difference by adding new predictions to the training dataset.
Another option is to use SGDRegressor instead of LinearRegression, since it has a partial_fit() method allows for incremental training which lets you train your model on new data without re-training the model on the whole dataset. You can find the documentation for this model here. Also this answer here explains the difference between SGDRegressor and Linear Regression.
If you still want to use LinearRegression and retrain the model, I suggest you use batches of data for updating your model, instead of retraining it on each new predicted value. You can wait for your predicted values to get to a certain number, for example 10, and then add these 10 new values to your training dataset and retrain the model just once. This answer here explains 3 approaches in retraining the model which might be useful for you.
I have a pandas dataframe that I want to make predictions on and get the root mean squared error for each feature. I'm following an online guide that splits the dataset manually, but I thought it would be more convenient to use train_test_split from sklearn.model_selection. Unfortunately, I'm getting different results when looking at the rmse values after splitting the data manually vs using train_test_split.
A (hopefully) reproducible example:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['feature_1','feature_2','feature_3','feature_4'])
df['target'] = np.random.randint(2,size=100)
df2 = df.copy()
Here is a function, knn_train_test, that splits the data manually, fits the model, makes predictions, etc:
def knn_train_test(train_col, target_col, df):
knn = KNeighborsRegressor()
np.random.seed(0)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
# Select the second half and set as test set.
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
# Fit a KNN model using default k value.
knn.fit(train_df[[train_col]], train_df[target_col])
# Make predictions using model.
predicted_labels = knn.predict(test_df[[train_col]])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predicted_labels)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df.columns.drop('target')
# For each column (minus `target`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
rmse_val = knn_train_test(col, 'target', df)
rmse_results[col] = rmse_val
# Create a Series object from the dictionary so
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
#Output
feature_3 0.541110
feature_2 0.548452
feature_4 0.559285
feature_1 0.569912
dtype: float64
Now, here is a function, knn_train_test2, that splits the data using train_test_split:
def knn_train_test2(train_col, target_col, df2):
knn = KNeighborsRegressor()
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(df2[[train_col]],df2[[target_col]], test_size=0.5)
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = df2.columns.drop('target')
for col in train_cols:
rmse_val = knn_train_test2(col, 'target', df2)
rmse_results[col] = rmse_val
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
# Output
feature_4 0.522303
feature_3 0.556417
feature_1 0.569210
feature_2 0.572713
dtype: float64
Why am I getting different results? I think I'm misunderstanding the split > train > test process in general, or maybe misunderstanding/mis-specifying train_test_split. Thank you in advance
Your custom train_test_split implementation differs from scikit-learn's implementation, that's why you get different results for the same seed.
Here you can find the official implementation. The first thing which is notable is, that scikit-learn is doing by default 10 iterations of re-shuffeling & splitting. (check the n_splits parameter)
Only if your approach is doing exactly the same as the scitkit-learn approach, then you can expect to have the same result for the same seed.
This is basic machine learning nature. When you manually split the data, you have a different version of training and testing set. When you use the sklearn function, you get different training and testing set. Your model will make prediction based on what training data it recieves and thus your final results are different for both.
If you want to reproduce result, then use the train_test_split to create multiple training set by setting a seed value. A seed value is used to reproduce the same result in the train_test_split function. Then when running your ml function, set a seed in there too as even ML functions start training with random weights. Try your model on these datasets with same seed and you will get the results.
Splitting data manually is just slicing but train_test_split will also randomize the sliced data. Try fix the random number seed and see if you can get same results each time when using train_test_split.
you can access the data set at this link https://drive.google.com/file/d/0B9Hd-26lI95ZeVU5cDY0ZU5MTWs/view?usp=sharing
My Task is to predict the price movement of a sector fund. How much it goes up or down doesn't really matter, I only want to know whether it's going up or down. So I define it as a classification problem.
Since this data set is a time-series data, I met many problems. I have read articles about these problems like I can't use k-fold cross validation since this is time series data. You can't ignore the order of the data.
my code is as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.svm import LinearSVC
from sklearn.svm import SVCenter code here
lag1 = pd.read_csv(#local file path, parse_dates=['Date'])
#Trend : if price going up: ture, otherwise false
lag1['Trend'] = lag1.XLF > lag1.XLF.shift()
train_size = round(len(lag1)*0.50)
train = lag1[0:train_size]
test = lag1[train_size:]
variable_to_use= ['rGDP','interest_rate','private_auto_insurance','M2_money_supply','VXX']
y_train = train['Trend']
X_train = train[variable_to_use]
y_test = test['Trend']
X_test = test[variable_to_use]
#SVM Lag1
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('XLF Lag1 dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
#Check prediction results
clf.predict(X_test)
First of all, is my method right here : first generating a column of true and false? I am afraid the machine can't understand this column if I simply feed this column to it. Should I first perform a regression then compare the numeric result to generate a list of going up or down?
The accuracy on training set is very low at : 0.58 I am getting an array with all trues with clf.predict(X_test) which I don't know why I would get all trues.
And I don't know whether the resulting accuracy is calculated in which way: for example, I think my current accuracy only counts the number of true and false but ignoring the order of them? Since this is time-series data, ignoring the order is not right and gives me no information about predicting price movement. Let's say I have 40 examples in test set, and I got 20 Tures I would get 50% accuracy. But I guess the trues are no in the right position as it appears in the ground truth set. (Tell me if I am wrong)
I am also considering using Gradient Boosted Tree to do the classification, would it be better?
Some preprocessing of this data would probably be helpful. Step one might go something like:
df = pd.read_csv('YOURLOCALFILEPATH',header=0)
#more code than your method but labels rows as 0 or 1 and easy to output to new file for later reference
df['Date'] = pd.to_datetime(df['date'], unit='d')
df = df.set_index('Date')
df['compare'] = df['XLF'].shift(-1)
df['Label'] np.where(df['XLF']>df['compare'), 1, 0)
df.drop('compare', axis=1, inplace=True)
Step two can use one of sklearn's built in scalers, such as the MinMax scaler, to preprocess the data by scaling your feature inputs before feeding it into your model.