Testing new data on a trained LGBM model - python

I am a newbie to ML, and trying to replicate a price optimization solution available at https://www.kaggle.com/tunguz/more-effective-ridge-lgbm-script-lb-0-44823?source=post_page-------
I followed the same code as given, and then trying to test it on a new data. However, it is not predicting the price correctly at all. I am making sure I save the trained model/vectors, load it fresh and transform the new data as per the model requirements, similar to as done to the training set.
The issue is, if my new data is exactly same as my Test dataset (600k + rows) used during testing the model, then it is returning me exact correct results as during test prediction. But if I use only, example, first 10 rows of it, then it is not matching the existing results at all, even though I am transforming the features through saved vectors.
#below is while training the model
cvname = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cvname.fit_transform(merge['name'])
pickle.dump(cvname, open("namevector.pkl", "wb"))
.
.
.
.
#after completing the training, and loading the new data
handle_missing_inplace(mytest)
cutting(mytest)
to_categorical(mytest)
cv1 = pickle.load(open("namevector.pkl", "rb"))
X_name1 = cv1.transform(mytest['name'])
cv2 = pickle.load(open("categoryvector.pkl", "rb"))
X_category1 = cv2.transform(mytest['category_name'])
tv1 = pickle.load(open("descriptionvector.pkl", "rb"))
X_description1 = tv1.transform(mytest['item_description'])
lb1 = pickle.load(open("brandvector.pkl", "rb"))
X_brand1 = lb1.transform(mytest['brand_name'])
t1 = pd.get_dummies(mytest[['item_condition_id', 'shipping']],sparse=True)
X_dummies1 = csr_matrix(t1.values.astype('int64'))
sparse_merge1 = hstack((X_dummies1, X_description1, X_brand1, X_category1, X_name1)).tocsr()
X_test1 = sparse_merge1
my_pred = pkl_bst1.predict(X_test1)
mysubmission['price'] = np.expm1(my_pred)
Can anyone please let me know what am I missing? The model worked fine on train and test dataset, but not on new data, or even small subset of Test dataset.

It is usually called overfitting. Or perhaps it is underfitting. As any other ML algorithm, LGBM is susceptible to both.
Meaning the model does very well on training and test data, but perform poorly on new data. The model is not generalizing well, it is just memorizing the training data.
There are some suggestions here on how to deal with overfitting on LGBM in particular, but there is a lot of information about the issue in general you should take the time to read. Google is the usual starting point.
Collecting more data is sometimes the way to deal with the problem. Hundred of thousands, millions. Machine learning is a data hungry business.
You will have to tweak some model parameters and do a lot of training until your predictions start to improve, if ever. It is called parameter tuning.
That's the tough side of ML.
Don't get discouraged though.

Related

Will it lead to Overfitting / Curse of Dimensionality

Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.

How to perform multi-step out-of-time forecast which does not involve refitting the ARIMA model?

I have an already existing ARIMA (p,d,q) model fit to a time-series data (for ex, data[0:100]) using python. I would like to do forecasts (forecast[100:120]) with this model. However, given that I also have the future true data (eg: data[100:120]), how do I ensure that the multi-step forecast takes into account the future true data that I have instead of using the data it forecasted?
In essence, when forecasting I would like forecast[101] to be computed using data[100] instead of forecast[100].
I would like to avoid refitting the entire ARIMA model at every time step with the updated "history".
I fit the ARIMAX model as follows:
train, test = data[:100], data[100:]
ext_train, ext_test = external[:100], external[100:]
model = ARIMA(train, order=(p, d, q), exog=ext_train)
model_fit = model.fit(displ=False)
Now, the following code allows me to predict values for the entire dataset, including the test
forecast = model_fit.predict(end=len(data)-1, exog=external, dynamic=False)
However in this case after 100 steps, the ARIMAX predicted values quickly converge to the long-run mean (as expected, since after 100 time steps it is using the forecasted values only). I would like to know if there is a way to provide the "future" true values to give better online predictions. Something along the lines of:
forecast = model_fit.predict_fn(end = len(data)-1, exog=external, true=data, dynamic=False)
I know I can always keep refitting the ARIMAX model by doing
historical = train
historical_ext = ext_train
predictions = []
for t in range(len(test)):
model = ARIMA(historical, order=(p,d,q), exog=historical_ext)
model_fit = model.fit(disp=False)
output = model_fit.forecast(exog=ext_test[t])[0]
predictions.append(output)
observed = test[t]
historical.append(observed)
historical_ext.append(ext_test[t])
but this leads to me training the ARIMAX model again and again which doesn't make a lot of sense to me. It leads to using a lot of computational resources and is quite impractical. It further makes it difficult to evaluate the ARIMAX model cause the fitted params to keep on changing every iteration.
Is there something incorrect about my understanding/use of the ARIMAX model?
You are right, if you want to do online forecasting using new data you will need to estimate the parameters over and over again which is computationally inefficient.
One thing to note is that for the ARIMA model mainly the estimation of the parameters of the MA part of the model is computationally heavy, since these parameters are estimated using numerical optimization, not using ordinary least squares. Since after calculating the parameters once for the initial model you know what is expected for future models, since one observation won't change them much, you might be able to initialize the search for the parameters to improve computational efficiency.
Also, there may be a method to do the estimation more efficiently, since you have your old data and parameters for the model, the only thing you do is add one more datapoint. This means that you only need to calculate the theta and phi parameters for the combination of the new datapoint with all the others, while not computing the known combinations again, which would save quite some time. I very much like this book: Heij, Christiaan, et al. Econometric methods with applications in business and economics. Oxford University Press, 2004.
And this lecture might give you some idea of how this might be feasible: lecture on ARIMA parameter estimation
You would have to implement this yourself, I'm afraid. As far as I can tell, there is nothing readily available to do this.
Hope this gives you some new ideas!
As this very good blog suggests (3 facts about time series forecasting that surprise experienced machine learning practitioners):
"You need to retrain your model every time you want to generate a new prediction", it also gives the intuitive understanding of why this happens with examples. That basically highlights time-series forecasting challenge as a constant change, that needs refitting.
I was struggling with this problem. Luckily, I found a very useful discussion about it. As far as I know, the case is not supported by ARIMA in python, we need to use SARIMAX.
You can refer to the link of discussion: https://github.com/statsmodels/statsmodels/issues/2788

Does the test set is used to update weight in a deep learning model with keras?

I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?
You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.
No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets

Real time data using sklearn

I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!

Machine Learning udacity

What does these codes mean?
can you explain to me :
features_train, labels_train, features_test, labels_test = makeTerrainData()
def submitAccuracy():
return acc
In machine learning development you want to split your available data into train/test sets and if possible an additional validation set. You do this to test for overfitting and ensure your model is generalizable to unseen observations. The final validation set is often useful because without knowing it, often users will try to optimize their parameters on the test partition accuracy, and in doing so are basically giving hints to the model of what that data is. The validation set is useful to test that this hasn't occurred and your model isn't overfit.
With only seeing the code provided, train_features likely corresponds to the actual data being used to develop the model, in the train partition. The labels are the categories you are trying to predict.
The test partition is simply a random sample of your available data. Features/labels are the same as above.
You want to build the model off of the training data, and assess accuracy on the test partition.
Sebastian Rascka provides a marvelous overview of machine learning in python. The code samples and some explanations can be found at https://github.com/rasbt/python-machine-learning-book/tree/master/code

Categories