I have 913000 rows data:
data image
First, Let me explain this data
this data is sales data for 10 stores and 50 item from 2013-01-01 to 2017-12-31.
i understand why this data has 913000, by leap year.
anyway, i made my training set.
training = TimeSeriesDataSet(
train_df[train_df.apply(lambda x:x['time_idx']<=training_cutoff,axis=1)],
time_idx = "time_idx",
target = "sales",
group_ids = ["store","item"], # list of column names identifying a time series
max_encoder_length = max_encoder_length,
max_prediction_length = max_prediction_length,
static_categoricals = ["store","item"],
# Categorical variables that do nat change over time (e.g. product length)
time_varying_unknown_reals = ["sales"],
)
Now
First Question: i have known as the TimeSeriesDataSet has data param, reflected data minus prediction horizon by training_cutoff and minus max_encoder_length for prediction. this is right? if no please tell me truth.
Second Question: Similarly, this is output of over code
output image
Why the length is 863500
i calculate the length on my knowledge.
prediction horizon by training_cutoff - 205010 =10000
max_encoder_length for prediction - 605010 = 30000
Thus 913000-40000 = 873000.
where is 9500?
i do my best in googling. please tell me truth..
Related
(And plot them all in the same figure).
I've been following the "Timeseries forecasting for weather prediction" code found here:
https://keras.io/examples/timeseries/timeseries_weather_forecasting/
The article says:
"The trained model above is now able to make predictions for 5 sets of values from validation set."
And it uses this code to get predictons and plot them:
def show_plot(plot_data, delta, title):
labels = ["History", "True Future", "Model Prediction"]
marker = [".-", "rx", "go"]
time_steps = list(range(-(plot_data[0].shape[0]), 0))
if delta:
future = delta
else:
future = 0
plt.title(title)
for i, val in enumerate(plot_data):
if i:
plt.plot(future, plot_data[i], marker[i], markersize=10, label=labels[i])
else:
plt.plot(time_steps, plot_data[i].flatten(), marker[i], label=labels[i])
plt.legend()
plt.xlim([time_steps[0], (future + 5) * 2])
plt.xlabel("Time-Step")
plt.show()
return
for x, y in dataset_val.take(5):
show_plot(
[x[0][:, 1].numpy(), y[0].numpy(), model.predict(x)[0]],
12,
"Single Step Prediction",
)
In my computer in order to downsample the series to 1 hour... instead of using "sampling_rate=6" I have directly modified the frequency of the input data and I'm using "sampling_rate=1"
Now, considering that the model was fitted properly... What do I need to modify if I want to get predictions for the next 500 intervals instead of just 5?
dataset_val.take(500)
Or something else?
The configuration at the beginning also says:
split_fraction = 0.715
train_split = int(split_fraction * int(df.shape[0]))
step = 6
past = 720
future = 72
learning_rate = 0.001
batch_size = 256
epochs = 10
What values do I need to use now for past and future (if my data has a frequency of 1 hour and I want to predict 500 points forward?
future = 500
past = ? (it seems to be the number of timestamps taken backwards for training)
What about delta? It's fixed to 12, but it seems to be the value for future.
according to the source
https://github.com/keras-team/keras-io/blob/master/examples/timeseries/timeseries_weather_forecasting.py
, here is the model
inputs = keras.layers.Input(shape=(inputs.shape[1], inputs.shape[2]))
lstm_out = keras.layers.LSTM(32)(inputs)
outputs = keras.layers.Dense(1)(lstm_out)
model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), loss="mse")
model.summary()
as you can see it uses a 1 unit Dense as a last layer. if you want for example 2 predictions u should use 2 units for Dense(as a last layer) and should be careful about the input shape of ( X_train, Y_train) and (X_Validation, Y_Validation) because your expected Y as a default has a 1 unit so probably you should convert it.
Simple example
Default X:1,Y:1 changes to X:1,Y:1,2
and probably Y data should be shifted(N) which N is exactly the number of units in the last layer(Dense).
If you just want to predict a bigger time frame you can convert your whole Data to the bigger one.
e.x our default Time frame and data (weather) is per hour. then we can convert our data to the daily ( which is x24 ) and then we can predict daily or the same thing (X30) and we can predict monthly and so that.
#Initialize the mode here
model = GFS(resolution='half', set_type='latest')
#the location I want to forecast the irradiance, and also the timezone
latitude, longitude, tz = 15.134677754177943, 120.63806622424912, 'Asia/Manila'
start = pd.Timestamp(datetime.date.today(), tz=tz)
end = start + pd.Timedelta(days=7)
#pulling the data from the GFS
raw_data = model.get_processed_data(latitude, longitude, start, end)
raw_data = pd.DataFrame(raw_data)
data = raw_data
#Description of the PV system we are using
system = PVSystem(surface_tilt=10, surface_azimuth=180, albedo=0.2,
module_type = 'glass_polymer',
module=module, module_parameters=module,
temperature_model_parameters=temperature_model_parameters,
modules_per_string=24, strings_per_inverter=32,
inverter=inverter, inverter_parameters=inverter,
racking_model='insulated_back')
#Using the ModelChain
mc = ModelChain(system, model.location, orientation_strategy=None,
aoi_model='no_loss', spectral_model='no_loss',
temp_model='sapm', losses_model='no_loss')
mc.run_model(data);
mc.total_irrad.plot()
plt.ylabel('Plane of array irradiance ($W/m^2$)')
plt.legend(loc='best')
Here is the picture of it
I am actually getting the same values for irradiance for days now. So I believe there is something wrong. I think there should somewhat be of different values for everyday at the least
Forecasting Irradiance
I think the reason the days all look the same is that the forecast data predicts those days to be consistently overcast, so there's not necessarily anything "wrong" with the values being very similar across days -- it's just several cloudy days in a row. Take a look at raw_data['total_clouds'] and see how little variation there is for this forecast (nearly always 100% cloud cover). Also note that if you print the actual values of mc.total_irrad, you'll see that there is some minor variation day-to-day that is too small to appear on the plot.
I have been attempting to use the hmmlearn package in python to build a model predicting values of a time series. I have based my code on this article, detailing how to use the package for a stock price time series.
After fitting the model on a large segment of the time series data and attempting to build a predictive model for the remainder, I run into an issue. The model always predicts the same outcome as being most probable - hmm.score returns the highest log-likelihood for the same outcome for every instance in the test series. Moreover, the outcome it predicts is the one closest to the mean value of the time series it was fitted on. It never deviates. I'm really not sure what to do. Is the model deficient, or am I doing something wrong?
The code that does the prediction is below. It appends all of the possible_outcomes (defined immediately below) to a sequence of test points in the time series (the last 100 in the test dataset) and evaluates the likelihood (using hmm.score):
possible_outcomes = np.linspace(-0.1, 0.1, 10)
latency_days = 10
def predict_close_price(time_index):
open_price = actuals_test[time_index]
predicted_frac_change = get_most_probable_outcome(time_index)
return open_price * (1 + predicted_frac_change)
def get_most_probable_outcome(time_index):
previous_data_start_index = max(0, time_index - latency_days)
previous_data_end_index = max(0, time_index - 1)
prev_start = int(previous_data_start_index)
prev_end = int(previous_data_end_index)
previous_data = test_data[prev_start: prev_end]
outcome_score = []
for possible_outcome in possible_outcomes:
total_data = np.row_stack((previous_data, possible_outcome))
outcome_score.append(hmm.score(total_data))
most_probable_outcome = possible_outcomes[np.argmax(outcome_score)]
print(most_probable_outcome)
return most_probable_outcome
predicted_close_prices = []
actuals_vector = []
for time_index in range(len(actuals_test)-100,len(actuals_test)-1):
predicted_close_prices.append(predict_close_price(time_index))
actuals_vector.append(actuals_test[(time_index)])
I don't know if the issue is with the above, or with the actual creation of data and fitting of the model itself. That is done simplistically as follows:
timeSeries.reverse()
difference_fracs = []
for i in range(0, len(timeSeries)-1):
difference_frac = ((timeSeries[i+1] - timeSeries[i])/(timeSeries[i]))
difference_fracs.append(difference_frac)
differences_array = np.array(difference_fracs)
differences_array = np.reshape(differences_array, (-1,1))
train_data_length = 2000
train_data = differences_array[:train_data_length,:]
test_data = differences_array[train_data_length:len(timeSeries),:]
actuals_test = timeSeries[train_data_length:]
n_hidden_states = 4
hmm = GaussianHMM(n_components = n_hidden_states)
hmm.fit(trainData)
I realize most of this is meaningless without the actual time series, which I am not allowed to share - though if someone has had similar issues in the past, I would love to hear your thoughts.
The data that I have is hourly recorded over the past 4 months. I am building a time series model and I've tried several methods so far: Arima, LSTMs, Prophet but they can be quite slow for my task since I have to run the model on thousands of time series in different locations. So then I thought it might be interesting to transform it into a supervised problem and use regression.
I extracted 4 features from the univariate time series and its time index, namely: dayofweek, hour, daily average and hourly average. So at the moment I am using these 4 predictors but could possibly extract more(like beginning of the day, noon, etc-also if you have any other suggestions here they are very welcomed :) )
I've used XGBoost for the regression and here are parts of the code:
# XGB
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# Functions needed
def convert_dates(x):
x['date'] = pd.to_datetime(x['date'])
#x['month'] = x['date'].dt.month
#x['year'] = x['date'].dt.year
x['dayofweek'] = x['date'].dt.dayofweek
x['hour'] = x['date'].dt.hour
#x['week_no'] = pd.to_numeric(x['date'].index.strftime("%V"))
x.pop('date')
return(x)
def add_avg(x):
x['daily_avg']=x.groupby(['dayofweek'])['y'].transform('mean')
x['hourly_avg'] = x.groupby(['dayofweek','hour'])['y'].transform('mean')
#x['monthly_avg']=x.groupby(['month'])['y'].transform('mean')
#x['weekly_avg']=x.groupby(['week_no'])['y'].transform('mean')
return x
xgb_mape_r2_dict = {}
I then run a for loop in which I select a location and build the model for it. Here I split the data into a train and test part. I knew there might be problems due to the Easter holidays in my country last week because those are rare events so that is why I split the training and test data in that manner. So I actually consider the data from the beginning of the year up to two weeks ago as training data and the very next week after that as test data.
for j in range(10,20):
data = df_all.loc[df_all['Cell_Id']==top_cells[j]]
data.drop(['Cell_Id', 'WDay'], axis = 1, inplace = True)
data['date'] = data.index
period = 168
data_train = data.iloc[:-2*period,:]
data_test = data.iloc[-2*period:-period,:]
data_train = convert_dates(data_train)
data_test = convert_dates(data_test)
data_train.columns = ['y', 'dayofweek', 'hour']
data_test.columns = ['y', 'dayofweek', 'hour']
data_train = add_avg(data_train)
daily_avg = data_train.groupby(['dayofweek'])['y'].mean().reset_index()
hourly_avg = data_train.groupby(['dayofweek', 'hour'])['y'].mean().reset_index()
Now, for the test data I add the past averages, namely the 7 daily averages from the past and the 168 hourly averages from the past as well. This is actually the part that takes the longest amount of time to run and I would like to improve its efficiency.
value_dict ={}
for k in range(168):
value_dict[tuple(hourly_avg.iloc[k])[:2]] = tuple(hourly_avg.iloc[k])[2]
data_test['daily_avg'] = 0
data_test['hourly_avg'] = 0
for i in range(len(data_test)):
data_test['daily_avg'][i] = daily_avg['y'][data_test['dayofweek'][i]]
data_test['hourly_avg'][i] = value_dict[(data_test['dayofweek'][i], data_test['hour'][i])]
My current run time is of 30 seconds for every iteration in the for loop which is way too slow because of the poor way that I use to add the averages in the test data. I would really appreciate if anyone could point out how could I implement this bit faster.
I will also add the rest of my code and make some other observations as well:
x_train = data_train.drop('y',axis=1)
x_test = data_test.drop('y',axis=1)
y_train = data_train['y']
y_test = data_test['y']
def XGBmodel(x_train,x_test,y_train,y_test):
matrix_train = xgb.DMatrix(x_train,label=y_train)
matrix_test = xgb.DMatrix(x_test,label=y_test)
model=xgb.train(params={'objective':'reg:linear','eval_metric':'mae'}
,dtrain=matrix_train,num_boost_round=500,
early_stopping_rounds=20,evals=[(matrix_test,'test')],)
return model
model=XGBmodel(x_train,x_test,y_train,y_test)
#submission = pd.DataFrame(x_pred.pop('id'))
y_pred = model.predict(xgb.DMatrix(x_test), ntree_limit = model.best_ntree_limit)
#submission['sales']= y_pred
y_pred = pd.DataFrame(y_pred)
y_test = pd.DataFrame(y_test)
y_test.reset_index(inplace = True, drop = True)
compare_df = pd.concat([y_test, y_pred], axis = 1)
compare_df.columns = ['Real', 'Predicted']
compare_df.plot()
mape = (np.abs((y_test['y'] - y_pred[0])/y_test['y']).mean())*100
r2 = r2_score(y_test['y'], y_pred[0])
xgb_mape_r2_dict[top_cells[j]] = [mape,r2]
I've used both R-squared and MAPE as accuracy measures although I don't think MAPE is indicated anymore since I've transformed the time series problem into a regression problem. Any thoughts on your part on this subject?
Thank you very much for your time and consideration. Any help is very much appreciated.
Update: I have managed to fix the issue using pandas' merge. I've first created two dataframes containing the daily averges and hourly averages from the training data and then merged these ataframes with the test data:
data_test = merge(data_test, daily_avg,['dayofweek'],'daily_avg')
data_test = merge(data_test, hourly_av['dayofweek','hour'],'hourly_avg')
data_test.columns = ['y', 'dayofweek', 'hour', 'daily_avg', 'hourly_avg']
where we used the merge function defined as:
def merge(x,y,col,col_name):
x =pd.merge(x, y, how='left', on=None, left_on=col, right_on=col,
left_index=False, right_index=False, sort=True,
copy=True, indicator=False,validate=None)
x=x.rename(columns={'sales':col_name})
return x
I can now run the model for 2000 locations per hour on a laptop with decent results but I will try to improve it while keeping it fast. Thank you very much once again.
I want to use Keras LSTM (or similar) to forecast energy consumption of businesses based on:
historical consumption data
some numerical features (e.g. total yearly consumption)
some categorical features (e.g. business type)
This is a cold-start problem because, while 2. and 3. are present both for the training and the test set, 1. is not, i.e. I am trying to predict consumption of new businesses for which there is no historical data.
My question is: how to structure the dataframe and the RNN to accomodate both 2. (numerical features) and 3. (categorical data) as my predictors?
Here is a made-up example of the data:
# generate x (predictors dataframe)
import pandas as pd
x = pd.DataFrame({'ID':[0,1,2,3],'business_type':[0,2,2,1], 'contract_type':[0,0,2,1], 'yearly_consumption':[1000,200,300,900], 'n_sites':[9,1,2,5]})
print(x)
# note: the first 2 are categorical and the second 2 are numerical
ID business_type contract_type yearly_consumption n_sites
0 0 0 0 1000 9
1 1 2 0 200 1
2 2 2 2 300 2
3 3 1 1 900 5
# generate y (timeseries data)
import numpy as np
time_series = []
data_length = 6
period = 1
for k in range(4):
level = 10 * np.random.rand()
seas_amplitude = (0.1 + 0.3*np.random.rand()) * level
sig = 0.05 * level # noise parameter (constant in time)
time_ticks = np.array(range(data_length))
source = level + seas_amplitude*np.sin(time_ticks*(2*np.pi)/period)
noise = sig*np.random.randn(data_length)
data = source + noise
index = pd.DatetimeIndex(start=t0, freq=freq, periods=data_length)
time_series.append(pd.Series(data=data, index=['t0','t1','t2','t3','t4','t5']))
y = pd.DataFrame(time_series)
print(y)
t0 t1 t2 t3 t4 t5
0 9.611984 8.453227 8.153665 8.801166 8.208920 8.399184
1 2.139507 2.118636 2.160479 2.216049 1.943978 2.008407
2 0.131757 0.133401 0.135168 0.141212 0.136568 0.123730
3 5.990021 6.219840 6.637837 6.745850 6.648507 5.968953
# note: the real data has thousands of data points (one year with half hourly frequency)
# note: the first row belongs to ID = 0 in x, the second row to ID = 1 etc.
I have looked extensively online, and there seem to be no example where both categorical, numerical and time-series data are used. For a simple forecasting problem, this post explains that in order to learn from the previous time period, the LSTM must be fed something like this:
# process df for a classical forecasting problem for first ID
y_lstm = pd.DataFrame(y.iloc[0,:])
y_lstm.columns = ['t']
y_lstm['t-1'] = y_lstm['t'].shift()
print(y_lstm)
t t-1
t0 9.611984 NaN
t1 8.453227 9.611984
t2 8.153665 8.453227
t3 8.801166 8.153665
t4 8.208920 8.801166
t5 8.399184 8.208920
# note: t-1 represents the previous time point
However, while this works for a single timeseries, it is unclear how to structure the dataset when there are multiple timeseries, and how to include the rest of the predictors in this structure.
This post talks about how to include both categorical and numerical variables through embedding, but does not fit my problem where also timeseries data has to be included. This post discusses between one-hot encoding and embedding without any example code and does not answer my question.
Could anyone please provide me with example code on how to structure the data appropriately for the RNN and/or how a simple LSTM structure with Keras would look like? Note that this stucture should be able to use the timeseries data for training, but not for predictions (i.e. only x and not y is available for the test set)
Thank you very much in advance.