I am currently training a 3D CNN for binary classification with relatively sparse labels (~ 1% of voxels in label data correspond to target class).
In order to perform basic sanity checks during the training (e.g. does the network learn at all?) it would be handy to present the network with a small, handpicked subset of training examples having an above-average fraction of target class labels.
As suggested by the Pytorch documentation, I implemented my own dataset class (inheriting from torch.utils.data.Dataset) which provides training examples via it's __get_item__ method to the torch.utils.data.DataLoader.
In the pytorch tutorials I found, the DataLoader is used as an iterator to generate the training loop like so:
for i, data in enumerate(self.dataloader):
# Get training data
inputs, labels = data
# Train the network
# [...]
What I am wondering now is whether there exist a simple way to load a single or a couple of specific training examples (using a the linear index understood by Dataset's __get_item__ method). However, DataLoader does not have a __get_item__ method and repeatedly calling __next__ until I reach the desired index does not seem elegant.
Apparently one possible way to solve this would be to define a custom sampler or batch_sampler inheriting from the abstract torch.utils.data.Sampler. But this seems over the top to retrieve a few specific samples.
I suppose I am overlooking something very simple and obvious here. Any advice appreciated!
Just in case anyone with a similar question comes across this at some point:
The quick-and-dirty workaround I ended up using was to bypass the dataloader in the training loop by directly accessing it's associated dataset attribute. Suppose we want to quickly check if our network learns at all by repeatedly presenting it a single, handpicked training example with linear index sample_idx (as defined by the dataset class).
Then one can do something like this:
for i, _ in enumerate(self.dataloader):
# Get training data
# inputs, labels = data
inputs, labels = self.dataloader.dataset[sample_idx]
inputs = inputs.unsqueeze(0)
labels = labels.unsqueeze(0)
# Train the network
# [...]
EDIT:
One brief remark, since some people seem to be finding this workaround helpful: When using this hack I found it to be crucial to instantiate the DataLoader with num_workers = 0. Otherwise, memory segmentation errors might occur in which case you could end up with very weird looking training data.
If you have defined
train_set = torchvision.datasets.CIFAR10(root='~/datasets/', train=True,
download=True, transform=(transform['train']))
then you can do something like
train_set.data[index] where index is the index of the specific example you want.
Now you can redefine you Dataset class with this new dataset that includes these specific examples and there you have it.
Related
I am trying to add data for my scikit-learn model after it has already been trained. For example, I have the data that I used in the beginning (there are about 250 of them). After that, I need to train this model one more time by calling the function, and so on. The only thing that came to my mind was to add new values to the existing data array every time and train the model again, but this is very resource-intensive and takes more time.
Is there another way to train the machine learning model?
model = LinearRegression().fit(test, result)
reg.predict(task)
### and here I want to add some data, for example one or two examples like:
model.addFit(one_test, one_result)
The short answer in your case (using the sklearn.linear_model.LinearRegression model) is no, it is not possible to add one or two more examples and train without adding this to the original training set and fitting it all at the same time. Under the hood, the model is simply using Ordinary Least Squares (described here) which requires the complete matrix of training data on which to fit your model. However, this algorithm is very fast and in the case of ~ hundreds of training examples, it would be very quick to re-calculate the parameters of the model with each new couple examples.
Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.
There are two methods when we make a model on sklearn.cluster.KMeans. First is fit() and other is fit_predict(). My understanding is that when we use fit() method on KMeans model, it gives an attribute labels_ which basically holds the info on which observation belong to which cluster. fit_predict() also have labels_ attribute.
So my question are,
If fit() fulfills the need then why their is fit_predict()?
Are fit() and fit_predict() interchangeable while writing code?
KMeans is just one of the many models that sklearn has, and many share the same API. The basic functions ae fit, which teaches the model using examples, and predict, which uses the knowledge obtained by fit to answer questions on potentially new values.
KMeans will automatically predict the cluster of all the input data during the training, because doing so is integral to the algorithm. It keeps them around for efficiency, because predicting the labels for the original dataset is very common. Thus, fit_predict adds very little: it calls fit, then returns .labels_. fit_predict is just a convenience method that calls fit, then returns the labels of the training dataset. (fit_predict doesn't have a labels_ attribute, it just gives you the labels.)
However, if you want to train your model on one set of data and then use this to quickly (and without changing the established cluster boundaries) get an answer for a data point that was not in the original data, you would need to use predict, not fit_predict.
In other models (for example sklearn.neural_network.MLPClassifier), training can be a very expensive operation so you may not want to re-train a model every time you want to predict something; also, it may not be a given that the prediction result is generated as a part of the prediction. Or, as discussed above, you just don't want to change the model in response to new data. In those cases, you cannot get predictions from the result of fit: you need to call predict with the data you want to get a prediction on.
Also note that labels_ is marked with an underscore, a Python convention for "don't touch this, it's private" (in absence of actual access control). Whenever possible, you should use the established API instead.
In scikit-learn, there are similar things such as fit and fit_transform.
Fit and predict or labels_ are essential for clustering.
Thus fit_predict is just efficient code, and its result is the same as the result from fit and predict (or labels).
In addition, the fitted clustering model is used only once when determining cluster labels of samples.
I'm training a U-Net CNN in Keras and one of the image classes is significantly under-represented in the training dataset. I'm using a class weighted loss function to account for this, but my worry is that with such a low batch size, and low class instance, only 1 in 10 batches are likely to include an image of this class. So even though the class is weighted, the network rarely sees it during training. Therefore, would it be bad practice to force the data generator to include at least one instance of this class while its selecting random pieces of data for the batch? I could then avoid a situation where the majority of training is unable to access a class of data that's vital to overall task accuracy.
I would recommend three possible techniques to handle this kind of problem :
Uniformize the probability to get an image of a given class : for example this for Pytorch (don't know which technology you are using, please provide it). (Easy, but least efficient)
Adapt the loss, by giving more weight to underbalanced classes (also easy, will give the same result as previous method, consider the easiest-to-implement method of both first)
Do some data augmentation (harder, but nowadays a lot of libraries provide efficient ways to do this)
EDIT : Sorry, did not see for Keras. A few useful links: for data augmentation, class balancing and loss adaptation
I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.
Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what #AI_Learning suggests and make a new model on the whole data (old+new).