Supervised Time Series efficiency improvement - python

The data that I have is hourly recorded over the past 4 months. I am building a time series model and I've tried several methods so far: Arima, LSTMs, Prophet but they can be quite slow for my task since I have to run the model on thousands of time series in different locations. So then I thought it might be interesting to transform it into a supervised problem and use regression.
I extracted 4 features from the univariate time series and its time index, namely: dayofweek, hour, daily average and hourly average. So at the moment I am using these 4 predictors but could possibly extract more(like beginning of the day, noon, etc-also if you have any other suggestions here they are very welcomed :) )
I've used XGBoost for the regression and here are parts of the code:
# XGB
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Functions needed
def convert_dates(x):
x['date'] = pd.to_datetime(x['date'])
#x['month'] = x['date'].dt.month
#x['year'] = x['date'].dt.year
x['dayofweek'] = x['date'].dt.dayofweek
x['hour'] = x['date'].dt.hour
#x['week_no'] = pd.to_numeric(x['date'].index.strftime("%V"))
x.pop('date')
return(x)
def add_avg(x):
x['daily_avg']=x.groupby(['dayofweek'])['y'].transform('mean')
x['hourly_avg'] = x.groupby(['dayofweek','hour'])['y'].transform('mean')
#x['monthly_avg']=x.groupby(['month'])['y'].transform('mean')
#x['weekly_avg']=x.groupby(['week_no'])['y'].transform('mean')
return x
xgb_mape_r2_dict = {}
I then run a for loop in which I select a location and build the model for it. Here I split the data into a train and test part. I knew there might be problems due to the Easter holidays in my country last week because those are rare events so that is why I split the training and test data in that manner. So I actually consider the data from the beginning of the year up to two weeks ago as training data and the very next week after that as test data.
for j in range(10,20):
data = df_all.loc[df_all['Cell_Id']==top_cells[j]]
data.drop(['Cell_Id', 'WDay'], axis = 1, inplace = True)
data['date'] = data.index
period = 168
data_train = data.iloc[:-2*period,:]
data_test = data.iloc[-2*period:-period,:]
data_train = convert_dates(data_train)
data_test = convert_dates(data_test)
data_train.columns = ['y', 'dayofweek', 'hour']
data_test.columns = ['y', 'dayofweek', 'hour']
data_train = add_avg(data_train)
daily_avg = data_train.groupby(['dayofweek'])['y'].mean().reset_index()
hourly_avg = data_train.groupby(['dayofweek', 'hour'])['y'].mean().reset_index()
Now, for the test data I add the past averages, namely the 7 daily averages from the past and the 168 hourly averages from the past as well. This is actually the part that takes the longest amount of time to run and I would like to improve its efficiency.
value_dict ={}
for k in range(168):
value_dict[tuple(hourly_avg.iloc[k])[:2]] = tuple(hourly_avg.iloc[k])[2]
data_test['daily_avg'] = 0
data_test['hourly_avg'] = 0
for i in range(len(data_test)):
data_test['daily_avg'][i] = daily_avg['y'][data_test['dayofweek'][i]]
data_test['hourly_avg'][i] = value_dict[(data_test['dayofweek'][i], data_test['hour'][i])]
My current run time is of 30 seconds for every iteration in the for loop which is way too slow because of the poor way that I use to add the averages in the test data. I would really appreciate if anyone could point out how could I implement this bit faster.
I will also add the rest of my code and make some other observations as well:
x_train = data_train.drop('y',axis=1)
x_test = data_test.drop('y',axis=1)
y_train = data_train['y']
y_test = data_test['y']
def XGBmodel(x_train,x_test,y_train,y_test):
matrix_train = xgb.DMatrix(x_train,label=y_train)
matrix_test = xgb.DMatrix(x_test,label=y_test)
model=xgb.train(params={'objective':'reg:linear','eval_metric':'mae'}
,dtrain=matrix_train,num_boost_round=500,
early_stopping_rounds=20,evals=[(matrix_test,'test')],)
return model
model=XGBmodel(x_train,x_test,y_train,y_test)
#submission = pd.DataFrame(x_pred.pop('id'))
y_pred = model.predict(xgb.DMatrix(x_test), ntree_limit = model.best_ntree_limit)
#submission['sales']= y_pred
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(inplace = True, drop = True)
compare_df = pd.concat([y_test, y_pred], axis = 1)
compare_df.columns = ['Real', 'Predicted']
compare_df.plot()
mape = (np.abs((y_test['y'] - y_pred[0])/y_test['y']).mean())*100
r2 = r2_score(y_test['y'], y_pred[0])
xgb_mape_r2_dict[top_cells[j]] = [mape,r2]
I've used both R-squared and MAPE as accuracy measures although I don't think MAPE is indicated anymore since I've transformed the time series problem into a regression problem. Any thoughts on your part on this subject?
Thank you very much for your time and consideration. Any help is very much appreciated.
Update: I have managed to fix the issue using pandas' merge. I've first created two dataframes containing the daily averges and hourly averages from the training data and then merged these ataframes with the test data:
data_test = merge(data_test, daily_avg,['dayofweek'],'daily_avg')
data_test = merge(data_test, hourly_av['dayofweek','hour'],'hourly_avg')
data_test.columns = ['y', 'dayofweek', 'hour', 'daily_avg', 'hourly_avg']
where we used the merge function defined as:
def merge(x,y,col,col_name):
x =pd.merge(x, y, how='left', on=None, left_on=col, right_on=col,
left_index=False, right_index=False, sort=True,
copy=True, indicator=False,validate=None)
x=x.rename(columns={'sales':col_name})
return x
I can now run the model for 2000 locations per hour on a laptop with decent results but I will try to improve it while keeping it fast. Thank you very much once again.

Related

Execution time of multiple forecasts in AutoARIMA in StatsForecast, Python. Index not read correctly

I want to run +10.000 forecasts on time series using AutoARIMA in Statsforecast. Execution time is super slow when I try to make more than one forecast. I have labelled my time series through the index. It seems Statsforecast dosn't seperate the two time series.
filter_train = df_71["pyth_date"]<='202012'
filter_test = df_71["pyth_date"]>'202012'
train, test=df_71[filter_train].drop('pyth_dato', axis=1),df_71[filter_test].drop('pyth_dato', axis=1)
display_side_by_side(train.filter(like = '2_2137', axis=0).tail(),
test.filter(like = '2_2137', axis=0).head(),
train.filter(like = '1_2137', axis=0).tail(),
test.filter(like = '1_2137', axis=0).head(),
titles=['Train','Test','Train','Test'])
I run the following StatsForecast code:
#forecast modellen
fcst = StatsForecast(
train,
models = [(auto_arima,12)],
freq = 'M',
n_jobs = 4
)
Y_hat_df_intervals = fcst.forecast(h=12, level=(85, 95))
print(fcst.uids)
Index(['1_2137', '2_2137'], dtype='object')
Doing a single forecast takes 0.3 seconds. Doing both takes 16 second.
I get the forecast (here as a graph):
How do I get StatsForecast to seperate the index correctly? Looking at the output, it seems like, it sees it as one series.
Thank you very much!

How to predict a time series set with statsmodels Holt-Winters

I have a set of data from January 2012 to December 2014 that show some trend and seasonality. I want to make a prediction for the next 2 years (from January 2015 to December 2017), by using the Holt-Winters method from statsmodels.
The data set is the following one:
date,Data
Jan-12,153046
Feb-12,161874
Mar-12,226134
Apr-12,171871
May-12,191416
Jun-12,230926
Jul-12,147518
Aug-12,107449
Sep-12,170645
Oct-12,176492
Nov-12,180005
Dec-12,193372
Jan-13,156846
Feb-13,168893
Mar-13,231103
Apr-13,187390
May-13,191702
Jun-13,252216
Jul-13,175392
Aug-13,150390
Sep-13,148750
Oct-13,173798
Nov-13,171611
Dec-13,165390
Jan-14,155079
Feb-14,172438
Mar-14,225818
Apr-14,188195
May-14,193948
Jun-14,230964
Jul-14,172225
Aug-14,129257
Sep-14,173443
Oct-14,188987
Nov-14,172731
Dec-14,211194
Which looks like follows:
I'm trying to build the Holt-Winters model, in order to improve the prediction performance of the past data (it means, a new graph where I can see if my parameters perform a good prediction of the past) and later on forecast the next years. I made the prediction with the following code, but I'm not able to do the forecast.
# Data loading
data = pd.read_csv('setpoints.csv', parse_dates=['date'], index_col=['date'])
df_data = pd.DataFrame(datos_matric, columns=['Data'])
df_data['Data'].index.freq = 'MS'
train, test = df_data['Data'], df_data['Data']
model = ExponentialSmoothing(train, trend='add', seasonal='add', seasonal_periods=12).fit()
period = ['Jan-12', 'Dec-14']
pred = model.predict(start=period[0], end=period[1])
df_data['Data'].plot(label='Train')
test.plot(label='Test')
pred.plot(label='Holt-Winters')
plt.legend(loc='best')
plt.show()
Which looks like:
Does anyone now how to forecast it?
I think you are making a misconception here. You shouldnt use the same data for train and test. The test data are datapoints which your model "has not seen yet". This way you can test how well your model is performing. So I used the last three months of your data as test.
As for the prediction, we can use different start and end points.
Also notice I used mul as seasonal component, which performs better on your data:
# read in data and convert date column to MS frequency
df = pd.read_csv(data)
df['date'] = pd.to_datetime(df['date'], format='%b-%y')
df = df.set_index('date').asfreq('MS')
# split data in train, test
train = df.loc[:'2014-09-01']
test = df.loc['2014-10-01':]
# train model and predict
model = ExponentialSmoothing(train, seasonal='mul', seasonal_periods=12).fit()
#model = ExponentialSmoothing(train, trend='add', seasonal='add', seasonal_periods=12).fit()
pred_test = model.predict(start='2014-10-01', end='2014-12-01')
pred_forecast = model.predict(start='2015-01-01', end='2017-12-01')
# plot data and prediction
df.plot(figsize=(15,9), label='Train')
pred_test.plot(label='Test')
pred_forecast.plot(label='Forecast')
plt.legend()
plt.show()
plt.savefig('figure.png')

Hidden Markov Model (HMM) in python (hmmlearn) always predicting same value for time series

I have been attempting to use the hmmlearn package in python to build a model predicting values of a time series. I have based my code on this article, detailing how to use the package for a stock price time series.
After fitting the model on a large segment of the time series data and attempting to build a predictive model for the remainder, I run into an issue. The model always predicts the same outcome as being most probable - hmm.score returns the highest log-likelihood for the same outcome for every instance in the test series. Moreover, the outcome it predicts is the one closest to the mean value of the time series it was fitted on. It never deviates. I'm really not sure what to do. Is the model deficient, or am I doing something wrong?
The code that does the prediction is below. It appends all of the possible_outcomes (defined immediately below) to a sequence of test points in the time series (the last 100 in the test dataset) and evaluates the likelihood (using hmm.score):
possible_outcomes = np.linspace(-0.1, 0.1, 10)
latency_days = 10
def predict_close_price(time_index):
open_price = actuals_test[time_index]
predicted_frac_change = get_most_probable_outcome(time_index)
return open_price * (1 + predicted_frac_change)
def get_most_probable_outcome(time_index):
previous_data_start_index = max(0, time_index - latency_days)
previous_data_end_index = max(0, time_index - 1)
prev_start = int(previous_data_start_index)
prev_end = int(previous_data_end_index)
previous_data = test_data[prev_start: prev_end]
outcome_score = []
for possible_outcome in possible_outcomes:
total_data = np.row_stack((previous_data, possible_outcome))
outcome_score.append(hmm.score(total_data))
most_probable_outcome = possible_outcomes[np.argmax(outcome_score)]
print(most_probable_outcome)
return most_probable_outcome
predicted_close_prices = []
actuals_vector = []
for time_index in range(len(actuals_test)-100,len(actuals_test)-1):
predicted_close_prices.append(predict_close_price(time_index))
actuals_vector.append(actuals_test[(time_index)])
I don't know if the issue is with the above, or with the actual creation of data and fitting of the model itself. That is done simplistically as follows:
timeSeries.reverse()
difference_fracs = []
for i in range(0, len(timeSeries)-1):
difference_frac = ((timeSeries[i+1] - timeSeries[i])/(timeSeries[i]))
difference_fracs.append(difference_frac)
differences_array = np.array(difference_fracs)
differences_array = np.reshape(differences_array, (-1,1))
train_data_length = 2000
train_data = differences_array[:train_data_length,:]
test_data = differences_array[train_data_length:len(timeSeries),:]
actuals_test = timeSeries[train_data_length:]
n_hidden_states = 4
hmm = GaussianHMM(n_components = n_hidden_states)
hmm.fit(trainData)
I realize most of this is meaningless without the actual time series, which I am not allowed to share - though if someone has had similar issues in the past, I would love to hear your thoughts.

How to manage large a dataset for regression?

my question has to do with a very large dataset I'm running a regression on in Python. I have categorical data (gender, industry, region, salary groupings, etc.) that I would like to run a regression on with statsmodels. The whole dataframe comes out to be about 83 columns in width after using pd.getdummies() on roughly 5 million lines.
Code:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from datetime import datetime as dt
#Start time
print('Start Time: ', dt.now())
#Variables
groups = ['sex', 'central_age', 'group_size', 'industry', 'region', 'salary']
base_cases = ['sex_Male', 'central_age_47.0', 'group_size_F. 100-249', 'salary_A. < 25',
'industry_H. Manufacturing - heavy, steel etc.', 'region_C. Division 3: East North Central']
aggregates = ['death_amount_exposed', 'death_claim_amount']
#Read/ format data to transform data into categorical variables
df = pd.read_pickle(r'./Life_Mortality_Data.pkl')
df = df[df['death_amount_exposed']!=0]
df['central_age'] = df['central_age'].apply(str)
final = pd.get_dummies(df[groups]).join(df[aggregates]).astype(float)
final.drop(base_cases, axis=1, inplace=True)
#Prepare sting of variables to regress on in next step
var_columns = list(final.columns)
for i in aggregates:
var_columns.remove(i)
variables = '+'.join('Q("' + i + '")' for i in var_columns)
#Training and testing with Poisson model
print('Regression Time: ', dt.now(), '\n')
res1 = smf.glm(formula='death_claim_amount ~'+variables, data=final, offset=np.log(final['death_amount_exposed']), family=sm.families.Poisson(sm.families.links.log())).fit()
#Print stats summary, base cases, and multiplicative factors
print(res1.summary())
print('Base Cases:')
for case in base_cases:
print(case)
print('\nParameters:\n', np.exp(res1.params))
#This takes the result of a statsmodel results table and transforms it into a dataframe
def results_summary_to_dataframe(results):
pvals = results.pvalues
coeff = results.params
std_err = results.bse
conf_lower = results.conf_int()[0]
conf_higher = results.conf_int()[1]
results_df = pd.DataFrame({"pvals":pvals,
"coeff":coeff,
"std_error":std_err,
"conf_lower":conf_lower,
"conf_higher":conf_higher
})
#Reordering columns
results_df = results_df[["coeff","std_error","pvals","conf_lower","conf_higher"]]
return results_df
#Write data to excel
results_summary_to_dataframe(res1).to_excel(r'./All_Regression_Amounts_v1.xlsx')
#End time
print('\nEnd Time: ', dt.now())
The problem I'm having is that I run out of memory at the point where the statsmodels regression is run. I am using the 64-bit version of Python on Windows and have 32 GB of memory which I thought would be more than enough to handle this kind of computation but am not sure if I'm not using all available memory or if something may be wrong with my code. I'm very new to this kind of analysis and handling this much data. I'd really appreciate any help on what I can do to resolve this error
When building linear models on datasets which are too large to hold in memory your best bet is to train the model with Stochastic Gradient Descent. This fits the model iteratively by gradient descent using repeated small samples of the data rather than all the data at once.
Scikit-learn has a SGDClassifier module which fits a linear model like this. You could take a look at that and see if it might work for you.

How to save predicted regression values inside a for loop?

I'm trying to use statsmodels to run separate logistic regressions for each "group" in a pandas dataframe and save the predicted probabilities for each observations (row). Each "group" represents about 2500 respondents or observations; I would like to get the predicted probability for each respondent - similar to how with SPSS you can "save" predicted probabilities when running a logistic regression.
I've read what others have attempted, but nothing seems to work. I'm using SPSS to check that the looping operation in Python is working correctly - the predicted probabilities should be the same (SPSS has a split function which makes this really easy).
import pandas as pd
import numpy as np
from statsmodels.formula.api import logit
df = pd.read_csv('test_data.csv')
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df)
print(est_result.summary())
df['pred'] = pred
The model summaries are correct (est_result.summary()) and match SPSS exactly. However, the saved predicted values do not match at all. I cannot seem to understand how to get it to work correctly.
Any advice is appreciated.
I solved it in a really un-pythonic kind of way. I hope someone can improve this code. The probabilities now match exactly what SPSS produces when you split the file by group, and run individual regressions by group.
result =[]
for cat in df['Brand'].unique():
df_slice = df[df.Brand == cat]
est = logit('binary ~ var_1', df_slice)
est_result = est.fit()
pred = est_result.predict(df_slice)
results.append(pred)
# print(est_result.summary())
n = len(df['Brand'].unique())
r = pd.DataFrame(results) #put the results into a dataframe
rt = r.T #tranpose the dataframe
r_small = rt[rt.columns[-n:]] #remove all but the last n columns, n = number of categories
r_new = r_small.bfill(axis=1).iloc[:, 0] #merge the n columns and remove the NaNs
r_new #show us
df['predicted'] = r_new # combine the r_new array with the original dataframe
df #show us.

Categories