Get best parameters for time series without losing data - python

I'm trying to do demand sensing for a dataset. Presently I have 157 weeks of data(~3years) and I have to predict next month(8 weeks).In the training dataset, I'm using 149weeks as a train and the last 8 weeks as Val to get the best hyperparameters. But I have observed that in the pred result, there's a huge gap in wmapes between Val and pred. I'm not sure if im overfitting because Val wmape is good.
the aim is to get best parameters such that the result pred will good for last month(last 4 weeks/8weeks).
note: there is a gap in train and pred i.e. if the train is till 31st jan22, pred will start from 1st mar22.
How can I overcome this problem?
Details: dataset: timeseries , algo: TCNmodel(darts lib),lang:python.

How you should split the data depends on some factors:
If you have a seasonal influence over a year, you can take a complete year for validation and two years for training.
If your data can be predicted from the last n-weeks, you can take some random n-week splits from the full dataset.
Whats more important here is that I think there's an error in your training pipeline. You should have a sliding window over n-weeks over the full training data and always predict the next 8 weeks from every n-week sequence.

Related

ARIMA Model Predicting a straight line for my temperature data

I have a temperature dataset of 427 days(daily temperature data) I am training the ARIMA model for 360 days and trying to predict the rest of the 67 days data and comparing the results. While fitting the model in test data I am just getting a straight line as predictions, Am i doing something wrong? `
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(train['max'],order=(1,1,2),)
results = model.fit()
results.summary()
start = len(train)
end = len(train) + len(test) -1
predictions= pd.DataFrame()
predictions['pred'] = results.predict(start=start, end=end, typ='levels').rename('ARIMA(1,1,1) Predictions')
Your ARIMA model uses the last two observations to make a prediction, that means the prediction for t(361) is based on true values of t(360) and t(359). The prediction of t(362) is based on the already predicted t(361) and the true t(360). The prediction for t(363) is based on two predicted values t(361) and t(360). The prediction is based on previous predictions, and that means that forecasting errors will negatively impact new predictions. The prediction for t(400) is based on predictions that are based on predictions that are based on predictions etc. Imagine your prediction deviates only 1% for each time step, the forecasting error will become bigger and bigger the more time steps you try to predict. In such cases the predictions often form a straight line at some point.
If you use and ARIMA(p, d, q) model, then you can forecast a maximum of q steps into the future. Predicting 67 steps into the future is a very far horizon and ARIMA is most likely not able to do that. Instead, try to predict only the next single or few time steps.

Pseudo Labelling on Text Classification Python

I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(
This is my datasets if you wanna try. I want to classify 'Label' from 'Content'
My steps are:
Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.
This is my code:
Performing pseudo labelling on text classification
from sklearn.naive_bayes import MultinomialNB
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Set the vector transformer (from data train)
columnTransformer = ColumnTransformer([
('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
'Content')
],remainder='drop')
def transforms(series):
before_vect = pd.DataFrame({'Content':series})
vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
return vector_transformer.transform(before_vect)
X_train_df = transforms(X_train);
X_test_df = transforms(X_test);
X_unlabeled_df = transforms(X_unlabeled)
# Fit classifier and make train/test predictions
nb = MultinomialNB()
nb.fit(X_train_df, y_train)
y_hat_train = nb.predict(X_train_df)
y_hat_test = nb.predict(X_test_df)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = nb.predict_proba(X_unlabeled_df)
preds = nb.predict(X_unlabeled_df)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed.
f1 scores image
=================EDIT=================
So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.
I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.
But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.
According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:
Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.
My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.
Am I doing something right? Thank you for all of your attention guys.. So much thank you..
I'm not good at machine learning.
Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:
One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
Simply iterate a fixed number of times (for instance 10 times).
The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.
It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)

Multivariate time series forecasting with 3 months dataset

I have 3 months of data (each row corresponding to each day) generated and I want to perform a multivariate time series analysis for the same :
the columns that are available are -
Date Capacity_booked Total_Bookings Total_Searches %Variation
Each Date has 1 entry in the dataset and has 3 months of data and I want to fit a multivariate time series model to forecast other variables as well.
So far, this was my attempt and I tried to achieve the same by reading articles.
I did the same -
df['Date'] = pd.to_datetime(Date , format = '%d/%m/%Y')
data = df.drop(['Date'], axis=1)
data.index = df.Date
from statsmodels.tsa.vector_ar.vecm import coint_johansen
johan_test_temp = data
coint_johansen(johan_test_temp,-1,1).eig
#creating the train and validation set
train = data[:int(0.8*(len(data)))]
valid = data[int(0.8*(len(data))):]
freq=train.index.inferred_freq
from statsmodels.tsa.vector_ar.var_model import VAR
model = VAR(endog=train,freq=train.index.inferred_freq)
model_fit = model.fit()
# make prediction on validation
prediction = model_fit.forecast(model_fit.data, steps=len(valid))
cols = data.columns
pred = pd.DataFrame(index=range(0,len(prediction)),columns=[cols])
for j in range(0,4):
for i in range(0, len(prediction)):
pred.iloc[i][j] = prediction[i][j]
I have a validation set and prediction set. However the predictions are way worse than expected.
The plots of the dataset are -
1. % Variation
Capacity_Booked
Total bookings and searches
The output that I am receiving are -
Prediction dataframe -
Validation Dataframe -
As you can see that predictions are way off what is expected. Can anyone advise a way to improve the accuracy. Also, if I fit the model on whole data and then print the forecasts, it doesn't take into account that new month has started and hence to predict as such. How can that be incorporated in here. any help is appreciated.
EDIT
Link to the dataset - Dataset
Thanks
One manner to improve your accuracy is to look to the autocorrelation of each variable, as suggested in the VAR documentation page:
https://www.statsmodels.org/dev/vector_ar.html
The bigger the autocorrelation value is for a specific lag, the more useful this lag will be to the process.
Another good idea is to look to the AIC criterion and the BIC criterion to verify your accuracy (the same link above has an example of usage). Smaller values indicate that there is a bigger probability that you have found the true estimator.
This way, you can vary the order of your autoregressive model and see the one that provides the lowest AIC and BIC, both analyzed together. If AIC indicates the best model is with lag of 3 and the BIC indicates the best model has a lag of 5, you should analyze the values of 3,4 and 5 to see the one with best results.
The best scenario would be to have more data (as 3 months is not much), but you can try these approaches to see if it helps.

LSTM - Multivariate Time Series Predictions

I was reading the tutorial on Multivariate Time Series Forecasting with LSTMs in Keras
https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/#comment-442845
I have followed through the entire tutorial and got stuck with a problem which is as follows-
In this tutorial, the train and test splits have 8 features viz., 'pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain' at step 't-1', while the output feature is 'pollution' at current step 't'.
This is because, the framing of the dataset as a supervised learning problem is about predicting the 'pollution' at current hour/time step 't', given the pollution and weather measurements at the prior hour/time step 't-1'
After fitting the model to the training and testing data splits, what if I want to make predictions for a new dataset having 7 features since it does not have 'pollution' feature in it and I explicitly just want to predict for this one feature using the other 7 features.
Thanks for your help!
How do I handle such a situation? (while the remaining 7 features remain the same)
Edit-
Assume that my dataset has the following 3 features while training/fitting the model-
shop_number, item_number, number_of_units_sold
AFTER, I have trained the LSTM model, I get a dataset having the features-
'shop_number' AND 'item_number'.
The dataset DOES NOT have 'number_of_units_sold'.
The 'input_shape' argument in 'LSTM' has 1 as time step and 3 as features while training.
But while predicting, I have 1 time step but ONLY 2 features (as 'number_of_units_sold' is what I have to predict).
So how should I proceed?
If pollution is the last feature:
X = original_data[:,:,:-1]
Y = original_data[:,:,-1:]
If pollution is the first feature
X = original_data[:,:,1:]
Y = original_data[:,:,:1]
Else
i = index_of_pollution_feature
X = np.concatenate([original_data[:,:,:i], original_data[:,:,i+1:],axis=-1)
Y = original_data[:,:,i:i+1]
Make a model with return_sequences=True, stative=False and that's it.
Don't use Flatten, Global poolings or anything that removes the steps dimension.
If you don't have any pollution data at all for training, then you can't.

Delay gap between reality and prediction

Using machine learning (as library I've tried Tensorflow and Tflearn (which, I know is just a wrapping of Tensorflow)) I'm trying to predict the congestion in an area for the next week (see my previous questions if you want more backstory on it). My training set is composed of 400K tagged entry (with the date an congestion value for each minute).
My problem is that I now have a time gap between predictions and reality.
If I had to draw a chart with the reality and prediction you would see that my prediction, while have the same shape as the reality is in advance. She increase/decrease before the reality. It started to make me think that maybe my training had a problem. It would seem like that my prediction didn't start when my training ended.
Both of my data-sets (training/testing) are on 2 different file. First I train on my training set (for convenience sake let's say it end at 100th minutes and my testing set start at 101th minute), once my model saved I do my predictions, it should normally then start to predict 101 or am I wrong somewhere? Because it seem like it's starting to predict way way after my training stopped (if I keep my example it would start predicting value 107 for example).
For now one of a bad fix was to remove from the training set as many value as I had of delay (take this example, it would be 7) and it worked, no more delay but I don't understand why I have this problem or how to fix it so it wouldn't happen later.
Following some advices found on different website it seem like having gap in my training dataset (missing timestamp in this case) could be a problem, seeing that there do was some (in total around 7 to 9% of the whole dataset was missing) I've used Pandas to add the missing timestamps (I've also gave them the congestion value of the last know timestamp) while I do think that it may have helped a little (the gap is smaller) it haven't fixed the problem.
I tried multistep forecasting, multivariate forecasting, LSTM, GRU, MLP, Tensorflow, Tflearn but it change nothing making me think it could come from my training.
Here is my model training.
def fit_lstm(train, batch_size, nb_epoch, neurons):
X, y = train[:, 0:-1], train[:, -1]
X = X.reshape(X.shape[0], 1, X.shape[1])
print X.shape
print y.shape
model = Sequential()
model.add(LSTM(neurons, batch_input_shape=(None, X.shape[1], X.shape[2]), stateful=False))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(nb_epoch):
model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
model.reset_states()
return model
The 2 shape are :
(80485, 1, 1)
(80485,)
(On this example I'm using only 80K of data as training for speed purpose).
As parameter I'm using 1 neuron, 64 of batch_size and 5 epoch.
My dataset is made of 2 file. First is the training file with 2 column:
timestamp | values
The second have the same shape but is the testing set (separated to avoid any influence of it on my prediction), the file is only used once every prediction have been made and to compare reality and prediction. The testing set start where the training set stop.
Do you have an idea of what could be the reason of this problem?
Edit:
On my code I have this function:
# invert differencing
yhat = inverse_difference(raw_values, yhat, len(test_scaled)+1-i)
# invert differenced value
def inverse_difference(history, yhat, interval=1):
return yhat + history[-interval]
It's supposed to invert the difference (to go from a scaled value to the real one).
When using it like in the pasted example (using the testing set) I get perfection, accuracy above 95% and no gap.
Since in reality we wouldn't know theses values I had to change it.
I tried first to use the training set but got the problem explained on this post:
Why is this happening? Is there an explanation for this problem?
Found it. It was a problem with the "def inverse_difference(history, yhat, interval=1):" function. In fact it make my result look like my last lines of training. This is why I had a gap, since there is a pattern in my data (peak always at more or less the same moment) I thought he was doing prediction while he was just giving me back values from training.

Categories