Unable to use Datetime data in SVM model

Unable to use Datetime data in SVM model - python

I have a dataframe to predict the energy consumption. The columns. are Timestamp and Daily KWH system.
When used in the SVM model, I'm getting Value error as below:
ValueError: Unknown label type: array([ 0. , 127.2264855 , 80.74373643, ..., 7.67569729,
3.32998307, 2.08538807])
Dataset consists of energy consumption for every half hour from Sept to Dec.
Here's a sample dataset
Timestamp Daily_KWH_System
0 2016-09-07 19:47:07 148.978580
1 2016-09-07 19:47:07 104.084760
2 2016-09-07 19:47:07 111.850947
3 2016-09-07 19:47:07 8.421390
4 2016-12-15 02:48:07 13.778317
5 2016-12-15 02:48:07 0.637790
So far I have done :
Read the CSV
data = pd.read_csv('C:/Users/anagha/Documents/Python Scripts/Half_Ho.csv')
Indexing
data['Timestamp'] = pd.to_datetime(data['Timestamp'])
data.index = data['Timestamp']
del data['Timestamp']
data
Plot the graph
data.resample('D', how='mean').plot()
Splitting into Train and Test
from sklearn.utils import shuffle
test = shuffle(test)
train = shuffle(train)
trainData = train.drop('Daily_KWH_System' , axis=1).values
trainLabel = train.Daily_KWH_System.values
testData = test.drop('Daily_KWH_System' , axis=1).values
testLabel = test.Daily_KWH_System.values
SVM Model
from sklearn import svm
model = svm.SVC(kernel='linear', gamma=1)
model.fit(trainData,trainLabel)
model.score(trainData,trainLabel)
Predict Output
predicted= model.predict(testData)
print(predicted)

SVC is Support Vector Classification. Using it will treat your labels categorically. It looks like you're actually trying to do regression. (Note your error, "unknown label type"). A good first step would be to check out SVR. Or you could bin your values to classes, e.g. 0-10, 10-20, etc.:
sklearn SVR

Related

How to create a model on time series data and update it?

I have a large dataset of 23k rows. That data looks like something below:
import pandas as pd
d = {'Date': ["1-1-2020", '1-1-2020', "1-2-2020", "1-2-2020"], 'Stock': ["FB", "F", "FB", "F"],
"last_price": [230,8,241,9], "price":[241,9,240,8.5]}
df = pd.DataFrame(data=d)
Date Stock_id last_price price
0 1-1-2020 5 230 241.0
1 1-1-2020 41 8 9.0
2 1-2-2020 5 241 240.0
3 1-2-2020 41 9 8.5
Note that data includes many stocks on many different dates. How can I create a model that uses the feature for example last_price and stock id to predict next-day price? And that uses the old data to re-train the data.
Now, this was the best thing I could do. I used LinearRegression but any other model advice can work.
X = df[['Stock_id', 'last_price']]
y = df[['price']]
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import linear_model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
lm = linear_model.LinearRegression()
lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
result = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
Index Actual Predicted
487 45 32
4154 420 512
Is there a way where the model is trained on the first 3000 rows? Then the model makes a prediction for say date 12-11-2020 and then adds 12-11-2020 info to make the prediction for 12-12-2020 and so on?
I was hoping to get something like this.
Date Actual Predicted
12-11-2020 45 32
12-11-2020 420 512
12-12-2020 43 34
12-12-2020 423 513

I don't think having the id in your training dataset is appropriate since ids and comparing them does not give any useable information and may result in a bad calculated linear function for your model. ID just signifies that you are talking about a specific stock and is constant for a specific stock in the whole dataset. Also the value of the Stock_id cannot does not have any meaning that can be used for comparing stocks together, for example having a Stock_id = 1 and Stock_id = 2 doesn't mean these 2 are closer together than Stock_id = 1 and Stock_id = 100, they are just names. So I think you should split your original dataset based on the Stock_id and only include last_price in each of these new training datasets (X). You can do that in several ways, one them being the groupby function of pandas:
grouped = df.groupby(df.Stock_id)
stock_1= grouped.get_group(1)
After that, you can use a for loop on the unique value of your Stock_id column to get all the ids and their dataframes. Then you define a regression model for each of these new datasets and use the fit method to train it.
To retrain or update your regression model, LinearRegression does not support partial fit and I think you need to use the fit method again each time you want to update your model. You can use the first N rows of each user to fit the model, then predict the value for the next last_price and add the predicted value to the N rows and re-fit the model on the extended dataset. However, if your model actually calculates a good line to predict the data, I don't think you will see that much of a difference by adding new predictions to the training dataset.
Another option is to use SGDRegressor instead of LinearRegression, since it has a partial_fit() method allows for incremental training which lets you train your model on new data without re-training the model on the whole dataset. You can find the documentation for this model here. Also this answer here explains the difference between SGDRegressor and Linear Regression.
If you still want to use LinearRegression and retrain the model, I suggest you use batches of data for updating your model, instead of retraining it on each new predicted value. You can wait for your predicted values to get to a certain number, for example 10, and then add these 10 new values to your training dataset and retrain the model just once. This answer here explains 3 approaches in retraining the model which might be useful for you.

Python/Pandas - confusion around ARIMA forecasting to get simple predictions

Trying to wrap my head around how to implement an ARIMA model to produce (arguably) simple forecasts. Essentially what I'm looking to do is forecast this year's bookings up until the end of the year and export as a csv. Looking something like this:
date bookings
2017-01-01 438
2017-01-02 167
...
2017-12-31 45
2018-01-01 748
...
2018-11-29 223
2018-11-30 98
...
2018-12-30 73
2018-12-31 100
Where anything greater than today (28/11/18) is forecasted.
What I've tried to do:
This gives me my dataset, which is basically two columns, data on a daily basis for whole of 2017 and bookings:
import pandas as pd
import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
# from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
df = pd.read_csv('data.csv',names = ["date","bookings"],index_col=0)
df.index = pd.to_datetime(df.index)
This is the 'modelling' bit:
X = df.values
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(1,1,0))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
# print('predicted=%f, expected=%f' % (yhat, obs))
#error = mean_squared_error(test, predictions)
#print(error)
#print('Test MSE: %.3f' % error)
# plot
plt.figure(num=None, figsize=(15, 8))
plt.plot(test)
plt.plot(predictions, color='red')
plt.show()
Exporting results to a csv:
df_forecast = pd.DataFrame(predictions)
df_test = pd.DataFrame(test)
result = pd.merge(df_test, df_forecast, left_index=True, right_index=True)
result.rename(columns = {'0_x': 'Test', '0_y': 'Forecast'}, inplace=True)
The trouble I'm having is:
Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?
2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?
What I think I need to do:
Grab my bookings dataset of 2017 and 2018 data from my database
Split it by 2017 and 2018
Produce some forecasts on 2018
Append this 2018+forecast data to 2017 and export as csv
The how-to and why is the problem I'm having.
Any help would be much appreciated

Here are some thoughts:
Understanding the train/test subsets. Correct me if I'm wrong but the Train set is used to train the model and produce the 'predictions' data and then the Test is there to compare the predictions against the test?
Yes that is correct. The idea is the same as any Machine Learning model, the data is split in train/test, a model is fit using the train data, and the test is used to compare using some error metrics the actual model predictions with the real data. However as you are dealing with time series data, the train/test split must be performed respecting the time sequence, as you already do.
2017 data looked good, but how do I implement it on 2018 data? How do I get the Train/Test sets? Do I even need it?
Do you actually have a csv with the 2018 data? All you need to do to split in train/test is the same as you do for the 2017 data, i.e keep up until some size as train, and leave the end to test your predictions train, test = X[0:size], X[size:len(X)]. However, if what you want is a prediction of today's date onwards, why not use all historical data as input to the model and use that to forecast?
What I think I need to do
Split it by 2017 and 2018
Why would you want to split it? Simply feed your ARIMA model all your data as a single time series sequence, thus appending both of your yearly data, and use the last size samples as test. Take into account that the estimate gets better the larger the sample size. Once you've validated the performance of the model, use it to predict from today onwards.

Any difference between H2O and Scikit-Learn metrics scoring?

I tried to use H2O to create some machine learning models for binary classification problem, and the test results are pretty good. But then I checked and found something weird. I tried to print the prediction of the model for the test set out of curiosity. And I found out that my model actually predicts 0 (negative) all the time, but the AUC is around 0.65, and precision is not 0.0. Then I tried to use Scikit-learn just to compare the metrics scores, and (as expected) they’re different. The Scikit learn yielded 0.0 precision and 0.5 AUC score, which I think is correct. Here's the code that I used:
model = h2o.load_model(model_path)
predictions = model.predict(Test_data).as_data_frame()
# H2O version to print the AUC score
auc = model.model_performance(Test_data).auc()
# Python version to print the AUC score
auc_sklearn = sklearn.metrics.roc_auc_score(y_true, predictions['predict'].tolist())
Any thought? Thanks in advance!

There is no difference between H2O and scikit-learn scoring, you just need to understand how to make sense of the output so you can compare them accurately.
If you'll look at the data in predictions['predict'] you'll see that it's a predicted class, not a raw predicted value. AUC uses the latter, so you'll need to use the correct column. See below:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# Generate predictions on a test set
pred = model.predict(test)
Examine the output:
In [4]: pred.head()
Out[4]:
predict p0 p1
--------- -------- --------
0 0.715077 0.284923
0 0.778536 0.221464
0 0.580118 0.419882
1 0.316875 0.683125
0 0.71118 0.28882
1 0.342766 0.657234
1 0.297636 0.702364
0 0.594192 0.405808
1 0.513834 0.486166
0 0.70859 0.29141
[10 rows x 3 columns]
Now compare to sklearn:
from sklearn.metrics import roc_auc_score
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
# 0.78170751032654806
Here you see that they are approximately the same. AUC is an approximate method, so you'll see differences after a few decimal places when you compare different implementations.

Isolation Forest Implementation

I would like to use Isolation Forest for identifying the Outlier's in my dataset.
Training set contains 4000 records with 40 feature columns with value 1 or 0.
I know how to use the Isolation Forest for 2 features using the sample example given in scikit learn.
How do I use all the 40 Features and see the outliers ?

I simplified the scikit example a bit. X is your Dataset with 40 features and 4000 rows. In this example it is 3 features and 100 rows. You fit the classifier with clf.fit(X) to your numerical data X, to learn the classifier the "boundaries" of your data. In the next step you classify the same data X with respect to your learned model and get an array y with 100 entries, one for each row in your dataset. Each entry in y is -1 (Outlier) or 1 (Inliner).
import numpy as np
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
# Generate train data
s = rng.randn(100, 5)
X = np.r_[s + 2, s - 2, s - 5]
# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X)
y = clf.predict(X)

Time Series Classification

you can access the data set at this link https://drive.google.com/file/d/0B9Hd-26lI95ZeVU5cDY0ZU5MTWs/view?usp=sharing
My Task is to predict the price movement of a sector fund. How much it goes up or down doesn't really matter, I only want to know whether it's going up or down. So I define it as a classification problem.
Since this data set is a time-series data, I met many problems. I have read articles about these problems like I can't use k-fold cross validation since this is time series data. You can't ignore the order of the data.
my code is as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.linear_model import LinearRegression
from math import sqrt
from sklearn.svm import LinearSVC
from sklearn.svm import SVCenter code here
lag1 = pd.read_csv(#local file path, parse_dates=['Date'])
#Trend : if price going up: ture, otherwise false
lag1['Trend'] = lag1.XLF > lag1.XLF.shift()
train_size = round(len(lag1)*0.50)
train = lag1[0:train_size]
test = lag1[train_size:]
variable_to_use= ['rGDP','interest_rate','private_auto_insurance','M2_money_supply','VXX']
y_train = train['Trend']
X_train = train[variable_to_use]
y_test = test['Trend']
X_test = test[variable_to_use]
#SVM Lag1
this_C = 1.0
clf = SVC(kernel = 'linear', C=this_C).fit(X_train, y_train)
print('XLF Lag1 dataset')
print('Accuracy of Linear SVC classifier on training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of Linear SVC classifier on test set: {:.2f}'
.format(clf.score(X_test, y_test)))
#Check prediction results
clf.predict(X_test)
First of all, is my method right here : first generating a column of true and false? I am afraid the machine can't understand this column if I simply feed this column to it. Should I first perform a regression then compare the numeric result to generate a list of going up or down?
The accuracy on training set is very low at : 0.58 I am getting an array with all trues with clf.predict(X_test) which I don't know why I would get all trues.
And I don't know whether the resulting accuracy is calculated in which way: for example, I think my current accuracy only counts the number of true and false but ignoring the order of them? Since this is time-series data, ignoring the order is not right and gives me no information about predicting price movement. Let's say I have 40 examples in test set, and I got 20 Tures I would get 50% accuracy. But I guess the trues are no in the right position as it appears in the ground truth set. (Tell me if I am wrong)
I am also considering using Gradient Boosted Tree to do the classification, would it be better?

Some preprocessing of this data would probably be helpful. Step one might go something like:
df = pd.read_csv('YOURLOCALFILEPATH',header=0)
#more code than your method but labels rows as 0 or 1 and easy to output to new file for later reference
df['Date'] = pd.to_datetime(df['date'], unit='d')
df = df.set_index('Date')
df['compare'] = df['XLF'].shift(-1)
df['Label'] np.where(df['XLF']>df['compare'), 1, 0)
df.drop('compare', axis=1, inplace=True)
Step two can use one of sklearn's built in scalers, such as the MinMax scaler, to preprocess the data by scaling your feature inputs before feeding it into your model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.