I want to create a custom classification model which also involves transformation of dataset. However, while performing the dataset transformation, the size of transformations increase rapidly. To tackle with this situation, I want to do sampling on training set, however, the test set cannot be sampled for obvious reason.
I'm not able to understand, what can I do with sklearn Pipeline method to handle this situation.
I am trying to implement a machine learning algorithm which detects irregular ecg signals. I extracted some features, but I am not sure how to manage a correct input for the classifier.
I have 20k different ecg signals, each signal has 1000 values. They are all labeld as correct or incorrect.
I choose e.g. the two features heart_rate and xposition_of_3_highest_peaks, but how to feed them into the classifier?
Following you can see my attempt, but everytime I add a second feature the score decreases. Why?
clf = svm.SVC()
#[64,70,48,89...74,58]
X_train_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
X_train_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
I am not sure if the StandardScaler().fit_transform is necessary or if the np.concatenate is correct? Maybe there is even a better classifier for this use case?
Sorry I am a complete beginner, please be kind :)
When you are doing any transformations for pre-processing, you must use the same process from the training data and apply it to the validation / test data. However, this process must use the same statistics from the training data, because you are assuming that the validation / test data also come from this same distribution. Therefore, you need to create an object to store the transformations of the training data, then apply it to the training and test data equally. Your decreased performance is because you are not applying the right statistics to both training and validation / test correctly. You are scaling both datasets using separate means and standard deviations, which can cause out-of-distribution predictions if your sample size isn't large enough.
Therefore, call fit_transform on the training data, then just transform on the validation / test data. fit_transform will simultaneously find the parameters of the scaling for each column, then apply it to the input data and return the transformed data to you. transform assumes an already fit scaler, such as what was done in fit_transform and applies the scaling accordingly. I sometimes like to separate the operations and do a separate fit on the training data, then transform on the training and validation/test data after. This is a common source of confusion for new practitioners. You also need to save the scaler object so you can apply this to your validation / test data later.
clf = svm.SVC()
#[64,70,48,89...74,58]
heartRate_scaler = StandardScaler()
X_train_heartRate = heartRate_scaler.fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = heartRate_scaler.transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
three_peaks_scalar = StandardScaler()
X_train_3_peaks = three_peaks_scalar.fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = three_peaks_scalar.transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
Take note that you can concatenate the features you want first, then apply the StandardScaler after the fact because the method applies the standardization to each feature/column independently. The above method of scaling the different sets of features and concatenating them after is no different than concatenating the features first, then scaling after.
Minor Note
I forgot to ask about the fe object. What is that doing under the hood? Does it use the training data in any way to get you features? You must make sure that this object operates on the statistics of your training data and test data, not separately. What I mentioned about ensuring that the pre-processing must match between training and validation/test, the statistics must also match in this fe object as well. I assume this either uses the training data's statistics to both sets of data, or it is an independent transformation that is agnostic. Either way, you haven't specified what this is doing under the hood, but I will assume the happy path.
Possible Improvement
Consider using a decision tree-based algorithm like a Random Forest Classifier that does not require scaling of the input features, as the job is to partition the feature space of your data into N-dimensional hypercubes, with N being the number of features in your dataset (if N=2, this would be a 2D rectangle, N=3 a 3D rectangle, etc). Depending on how your data is distributed, tree-based algorithms can do better and are the first things to try in Kaggle competitions.
Hello to all you great minds,
I'm trying to understand more rigorously the way polynomial fitting works with scikit. More specifically, what I'm trying to do is break down the process, and to only show a dataframe with the new polynomial features generated based on a single value.
So I have data which with several entries, each is 1-dimensional. I want to generate a design matrix suitable for polynomial fitting. What I am currently doing is along these lines:
pd.DataFrame(PolynomialFeatures(k).fit_transform(X))
And this works as expected.
However, what I'm struggling with is the role of fit_transform(). As far as I am concerned, and I not trying to fit anything quiet yet, merely produce a dataframe with the newly constructed polynomial features. Naively I tried changing fit_transform() to transform(), but apparently I have to use fit before I am allowed to transform.
I would appreciate it if anyone could point me to my error. I am not yet trying to fit a model on the data, only to create a design matrix with the polynomial features, so why do I have to use fit() (or fit_transform(), to that matter)? In fact, I don't really understand what fit() actually does here, and the documentation didn't help me wrap my head around it.
Thank you!
I think the reason for this is to be consistent with their API. When doing preprocessing you still want to "fit" to some train data and apply the same preprocessing step to the train AND the test data.
An example where it becomes more clear is Standardscaling (which is a different preprocessing step). You calculate the mean and std from the train data and apply the same scaling (X - mean) / std to the train AND test data (with the mean and std taken from the train data.
Therefore the two methods fit and transform are separated.
In your case of polynomial features it probably makes no sense to "fit", because no information is extracted from the train data and the step can directly be applied to the test data without knowing the train data. But including the fit in PolynomialFeatures makes it consistent with their whole API. The consistency becomes necessary when you pipe multiple preprocessing steps.
I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)
Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.
I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?
Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?
Is it best practice to re-fit and transform the scaler on every fold?
Yes. You might want to read scikit-learn's doc on cross-validation:
Just as it is important to test a predictor on data held-out from
training, preprocessing (such as standardization, feature selection,
etc.) and similar data transformations similarly should be learnt from
a training set and applied to held-out data for prediction.
Which scaler should I save?
Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.
If you're doing a regression model, it's that simple.
If your model training requires hyperparameter search using
cross-validation (e.g., grid search for xgboost learning parameters),
then you have already gathered information from across folds, so you
need another test set to estimate true out-of-sample model
performance. (Once you have made this estimation, you can retrain
yet again on combined train+test data. This final step is not always done for neural
networks that are parameterized for a particular sample size.)
I have two set of data let's say A and B.
I want to apply PCA and T-sne to A and fine tune the algo.
Once, I am satisfied with my tuning I want to save the learnt things to some pickle file.
Now I want to apply the same learnt PCA and t-sne to set B.
I want t-sne to produce the same results every time on B. I am hoping this because I am assuming, we can save the state of learnt t-sne parametes also. If the parametes are same, and when I load the same file everytime, the result of applying t-sne on the set B should be same every time.
How can I do this in Sklearn and python?
I am sorry, I am new to ML and python, this may be a very basic question.
Fine-tuning T-SNE equals tuning some heuristic-algorithm (it's ill-conditioned after all; higher-dimension -> lower-dimension mapping is lossy) for your data.
Applying this tuned & learned mapping to other data is done by sklearn's transform.
But: you will see that there is no transform-method for T-SNE and the reason is given here (including further discussion):
It is a transductive learner, like many clustering algorithms: the model is not really applicable beyond the data points it is fed as training.
So whatever you tuned for dataset A, does not really apply to dataset B (including parameters)!
For PCA this is trivial. Use the methods described in docs: model_persistence and use PCA's transform-method (assuming compatible datasets; dimensions!).
To be able to save the models you should use the below library:
from joblib import dump, load
after establishing the model as below in PCA:
pca_model = PCA(n_components=n)
you can save the model in joblib format in the current directory:
dump(pca_model, 'pca_model.joblib')