Is this important to clean test data? - python

In the training data, I did feature engineering and clean my data. Is this important to do the same with test data?
I know some basic modifications like label encoding, dependent/independent feature split, etc.. are required in test data as well. But do we really need to CLEAN the test data before we do the predictions?

I can't answer you with Yes or No so let me start with the data distribution on all of your Train/Test/Dev set.
According to Prof.Andrew ng, the Test and Dev set should come from the same distribution (Youtube), but the trainig set can come from a different distribution (Check it here), and often it's a good thing to do.
Sometimes cleaning the trainig set is very useful and also applying some basic operation for speeding the training process(Like Normalization which is not cleaning), but we are talking about training data which can and should have thousands of thousands of examples, so sometimes you can't check your data manually and clean it, Because it's maybe not worthy at all;
What do I mean? well let me show you an example:
Let's say you're bulding a cat classifier (Cat or no-Cat), and you have an accuracy of 90%, which means that you've 10% Error.
after doing Error-analysis(Check it here) you find out that:
6% of your error is caused of misslabeled images,(No-cat images
labeled as cat and viceversa).
44% is caused of Blurry images.
50% is caused by images of Big Cats labeled as cats.
in this case all the time your will spend fixing the misslabeled images will improve your performance (0.6%) in the best scenario (Because it's 6% from the whole 10% error), so IT'S NOT WORTHY correcting the misslabeld data.
I gave an example on misslabeled data, but in general I mean any type of cleaning and fixing.
BUT cleaning the data in the test set may be easier, and it should be done both to Test/Dev sets if it's possibile because your test set will reflect the performance of your system on the real time data.
The Operations that you mentioned in your question are not quite cleaning but used for speeding up the process of learning or make the data apprpriate for the algorithm, and applying them depends on the shape and type of the data(images, Voice Records, words..), and on the problem you're trying to solve.
in the end As an answer, I can tell you that:
the form and shape of the data should be the same in all the three
sets (so applying label encoding should be for the whole data, not just
for the training data, and also for the input data used for
prediction because it changes the shape of the output label).
The number of features should be the same always.
Any operation that changes the (shape, form, number of features, ...) applied to the data should be applied on every single sample that you're gonna use in your system.

It depends:
Normalizing the data: If you normalized your training data, then yes, normalize the test data in exactly the way you normalized the training data. But be careful that you do not re-tune any parameters you tuned on the training data.
Filling missing values: idem. Treat the test data as the training data but do not re-tune any of the parameters.
Removing outliers: probably not. The aim of the test set is to make an estimate about how well your model will perform on unseen data. So removing outliers will probably not be a good idea.
In general: only do things to your test data that you can/will also do on unseen data upon applying your model.

Related

Types of Training vs Test Error for Random Forest Classification Algorithm (Assessing Variance)

I have 2 questions that I would like to ascertain if possible (questions are bolded):
I've recently understood (I hope) the random forest classification algorithm, and have tried to apply it using sklearn on Python on a rather large dataset of pixels derived from satellite images (with the features being the different bands, and the labels being specific features that I outlined by myself, i.e., vegetation, cloud, etc). I then wanted to understand if the model was experiencing a variance problem, and so the first thought that came to my mind was to compare between the training and testing data.
Now this is where the confusion kicks in for me - I understand that there have been many different posts about:
How CV error should/should not be used compared to the out of bag (OOB) error
How by design, the training error of a random forest classifier is almost always ~0 (i.e., fitting my model on the training data and using it to predict on the same set of training data) - seems to be the case regardless of the tree depth
Regarding point 2, it seems that I can never compare my training and test error as the former will always be low, and so I decided to use the OOB error as my 'representative' training error for the entire model. I then realized that the OOB error might be a pseudo test error as it essentially tests trees on points that they did not specifically learn (in the case of bootstrapped trees), and so I defaulted to CV error being my new 'representative' training error for the entire model.
Looking back at the usage of CV error, I initially used it for hyperparameter tuning (e.g., max tree depth, number of trees, criterion type, etc), and so I was again doubting myself if I should use it as my official training error to be compared against my test error.
What makes this worse is its hard for me to validate what I think is true based on posts across the web because each answers only a small part and might contradict each other, and so would anyone kindly help me with my predicament on what to use as my official training error that will be compared to my test error?
My second question revolves around how the OOB error might be a pseudo test error based on datapoints not selected during bootstrapping. If that were true, would it be fair to say this does not hold if bootstrapping is disabled (the algorithm is technically still a random forest as features are still randomly subsampled for each tree, its just that the correlation between trees are probably higher)?
Thank you!!!!
Generally, you want to distinctly break a dataset into training, validation, and test. Training is data fed into the model, validation is to monitor progress of the model as it learns, and test data is to see how well your model is generalizing to unseen data. As you've discovered, depending on the application and the algorithm, you can mix-up training and validation data or even forgo validation data entirely. For random forest, if you want to forgo having a distinct validation set and just use OOB to monitor progress that is fine. If you have enough data, I think it still makes sense to have a distinct validation set. No matter what, you should still reserve some data for testing. Depending on your data, you may even need to be careful about how you split up the data (e.g. if there's unevenness in the labels).
As to your second point about comparing training and test sets, I think you may be confused. The test set is really all you care about. You can compare the two to see if you're overfitting, so that you can change hyperparameters to generalize more, but otherwise the whole point is that the test set is to the sole truthful evaluation. If you have a really small dataset, you may need to bootstrap a number of models with a CV scheme like stratified CV to generate a more accurate test evaluation.

Worse results when training on entire dataset

After finalizing the architecture of my model I decided to train the model on the entire dataset by setting validation_split = 0 in fit(). I thought this would improve the results based on these sources:
What is validation data used for in a Keras Sequential model?
Your model doesn't "see" your validation set and isn´t in any way trained on it
https://machinelearningmastery.com/train-final-machine-learning-model/
What about the cross-validation models or the train-test datasets?
They’ve been discarded. They are no longer needed.
They have served their purpose to help you choose a procedure to finalize.
However, I got worse results without the validation set (compared to validation_split = 0.2), leaving all other parameters the same.
Is there an explanation for this? Or was it just by chance that my model happened to perform better on the fixed test data when a part of the training data was excluded (and used as validation).
Well that's really a very good question that covers a lots of machine learning related concepts specially Bias-Variance Tradeoff
As in the comment #CrazyBarzillian hinted that more data might be leading to over-fitting and yes we need more info about your data to come to a solution. But in a broader way I would like to explain you few points, that might help you to understand as it why it happened.
EXPLAINATION
Whenever your data has more number of features, your model learns a very complex equation to solve it. In short model is too complicated for the amount of data we have. This situation, known as high variance, leads to model over-fitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features (by applying PCA , outlier removal etc.), by increasing the number of data points that is adding more data.
Sometimes, you have lesser features in your data and hence model learns a very simple equation to solve it. This is known as high bias. In this case , adding more data won't help. In this case less data will do the work or adding more features will help.
MY ASSUMPTION
I guess your model is suffering from high bias if its performing poor on adding more data. But to check whether the statement adding more data leading to poor results is correct or not in your case you can do the following things:-
play with some hyperparameters
try other machine learning models
instead of accuracy scores , look for r2 scores or mean absolute error in case of regression or F1, precision, recall in case of classification
If after doing both things you are still getting the same results that more data is leading to poor results, then you can be sure of high bias and can either increase the number of features or reduce the data.
SOLUTION
By reducing the data, I mean use small data but better data. Better data means suppose you are doing a classification problem and you have three classes (A, B and C) , a better data would be if all the datapoints are balanced between three classes. Your data should be balanced. If it is unbalanced that is class A has high number of samples while class B and C has only 3-4 samples then you can apply Ramdom Sampling techniques to overcome it.
How to make BETTER DATA
Balance the data
Remove outliers
Scale (Normalize) the data
CONCLUSION
It is a myth that more data is always leads to good model. Actually more than the quantity , quality of the data also matters. Data should have both quantity as well as quality. This game of maintaining quality and quantity is known as Bias-Variance Tradeoff.

Do we apply fourier/wavelet transform to entire time series or only the training set for forecasting?

I was trying to implement the WSAE-LSTM model from the paper A deep learning framework for financial time series using stacked autoencoders and long-short term memory
. In the first step, Wavelet Transform is applied to the Time Series, although the exact implementation is not outlined in the paper.
The paper hints at applying Wavelet Transform to the whole dataset. I was wondering if this leaks data from testing to training? This article also identifies this problem.
From the article-
I’m sure you’ve heard many times that whenever you’re normalizing a time series for a ML model to fit your normalizer on the train set first then apply it to the test set. The reason is quite simple, our ML model behaves like mean reverter so if we normalize our entire dataset in one go we’re basically giving our model the mean value it needs to revert to. I’ll give you a little clue if we knew the future mean value for a time series we wouldn’t need machine learning to tell us what trades to do ;)
It’s basically the exact same issue as normalising your train and test set in one go. You’re leaking future information into each time step and not even in a small way. In fact you can run a little experiment yourself; the higher a level wavelet transform you apply, miraculously the more “accurate” your ML model’s output becomes.
Can someone tell me if Wavelet Transform "normalizes" the dataset which would lead to data leakage when forecasting? Should it be applied to the whole dataset or only the training dataset?
You are correct. Applying the transform to the whole dataset introduces "tachions" (leakage from the future) and makes the analysis essentially worthless.

Improving prediction accuracy in Bayesian Causal Network

I would like to determine the causes of an unexpected outcome (or anamoly) in a thermodynamic process. I have continuous data of the associated variables and trying to make use of 'Bayesian Network (BN)' for the determination of causality relationships. For this purpose, I used a library called 'Causalnex' in Python.
I have followed the tutorial section of this library to build the DAG,BN model and everything works fine upto the step of predictions. The prediction results of minority/less majority classes have an accuracy of around 60-70% (80-90% with SMOTE/SMOTETomek and a particular random state) whereas a stable accuracy of more than 90% is expected. I have implemented following data-preprocessing steps.
Ensuring no missing/NaN values
Discretization (only it is supported by the library)
SMOTE/SMOTETomek for data balancing
Various train/test size combinations
I am struggling to figure out the ways to optimize the model. I could not find any supportive material in Internet for the same.
Are there any Guidelines or 'Best practices' of data pre-processing techniques and dataset requirements that particulary work for this library/ BN model? Could you please suggest any troubleshooting methods to identify the causes of low accuracy/metrics? Perhaps a misunderstood node-node causal relationship in DAG causes mediocre accuracy?
Any ideas/literature/other suitable library regarding this would be of great help!
A few tips that can help:
Changing/Tuning the Structure learning.
Trying different thresholds. When doing from_pandas, you can experiment with different w-threshold values (and the beta term (if you are using from_pandas_lasso)).
This will change the density of the network. A more dense structure implies a BN with more parameters. If the structure is more dense, you have more parameters and your model may perform better. If it is too dense, though, you may not have enough data to train it and may overfit.
Center the Data. Empirically, it seems that NOTEARS (the algorithm behind from_pandas) works best if the data is centered. So, subtracting the mean of the see this may be a good idea.
Ensure causality. NOTEARS does not ensure causality. So we need "experts" to judge the output and make the necessary modifications. If you see edges that don't make causal sense, you can either remove them or add them as tabu_edges and train your network again.
Experiment with discretisation. The performance can be very sensitive to how you discretise the data. Experimenting with various types of discretisation can help. You can use:
Methods available in Causalnex (uniform, for example)
fixed discretisations based on what thresholds make sense for your data
MDLP is a supervised way to discretise data. You can apply MDLP for each node having as "target" one of its children. There are 2 main packages for MDLP in pypy: mdlp and mdlp-discretization

Split into test and train set before or after generating document term matrix?

I'm working on simple machine learning problems and I trying to build a classifier that can differentiate between spam and non-spam SMS. I'm confused as to whether I need to generate the document-term matrix before splitting into test and train sets or should I generate the document-term matrix after splitting into test and train?
I tried it both ways and found that the accuracy is slightly higher when the I split the data before generating the document-term matrix. But to me, this makes no sense. Shouldn't the accuracy be the same? Does the order of these operations make any difference?
Qualitatively, you don't need to do it either way. However, proper procedure requires that you keep your training and test data entirely separate. The overall concept is that the test data are not directly represented in the training; this helps reduce over-fitting. The test data (and later validation data) are samples that the trained model has never encountered during training.
Therefore, the test data should not be included in your pre-processing -- the document-term matrix. This breaks the separation, in that the model has, in one respect, "seen" the test data during training.
Quantitatively, you need to do the split first, because that matrix is to be used for training the model against only the training set. When you included the test data in the matrix, you obtained a matrix that is slightly inaccurate in representing the training data: it no longer properly represents the data you're actually training against. This is why your model isn't quite as good as the one that followed proper separation procedures.
It's a subtle difference, most of all because the training and test sets are supposed to be random samples of the same population of possible inputs. Random differences provide the small surprise you encountered.

Categories