How to clean a large image dataset for deep learning purposes?

How to clean a large image dataset for deep learning purposes? - python

I have a large image dataset with 477 classes (about 500,000 images). Each class contains some irrelevant images, so when it's trained on a model the model accuracy is not acceptable. Regarding the number of classes, it takes much time to clean the dataset manually with help of a human. Is there any way to remove such images automatically? (like a machine learning method or algorithm)

I believe that for now the best (most reliable) way to clean image datasets is manually. There might be some techniques that could be applied. For now, services like Azure and Amazon ML have some ways to clean data, however, I don't know if they apply that to images (https://learn.microsoft.com/en-us/azure/machine-learning/team-data-science-process/prepare-data). For sure there are companies that have a well developed way of doing this.
Maybe you can get inspired by this paper: https://stefan.winklerbros.net/Publications/icip2014a.pdf

One possible way is using a classifier to remove unwanted images from your dataset but this way is useful only for huge datasets and it is not as reliable as the normal way (manual cleansing). For example, an SVM classifier can be trained to extract images from each class. More details will be added after testing this method.

Related

CNN feature extraction having multiple column classes

I have a dataset which consists of power signals and the targets are multiple house appliances that might be on or off. I only need to do feature extraction on the signal using cnn and then save the dataset as a csv file to use it with another Machine learning method.
I used CNN before for classification on signals consisting of 6 classes. However, i am a bit confused and i need your help. I have two questions (might be stupid and am sorry)
Do i need the target variable in order to do feature extraction?
The shape of my dataset is for example 40000x100. I need my extracted dataset (the features learned using CNN) to have the same amount of rows (i.e. 40000). How can i do that?
I know that the answer might be simpler than i think but at the moment i feel quite lost.
I would appreciate any help.
Thanks

Tensorflow should I crop objects from the images for better accuarcy?

I am new to tensorflow and I could not find an answer for my question.
I am trying to make a simple program what recognises the type of van from the picture. I downloaded about 100 pictures for my dataset from each category.
My question is should I crop the pictures so only the van is visible on the picture?
Or should I use the original picture with the background for better accuracy?

The short answer is yes it will, but there is a lot more to consider when asking that question. For example, when I use this model how will the images look? Will there be someone to manually crop these images and then use the model or will someone be taking these photos off of a cell phone using an app? A core concept of Machine learning is to have the images in the production environment as close to the training data as possible so that your performance in production doesn't change.
If you're just trying to learn I would highly recommend trying to build a network on MNIST or lego bricks dataset before you try on your own images as if you get stuck on either there are a lot of great resources available :). Also, consider setting aside 10 images as a Test set so that you can evaluate model performance. And third, Tensorflow has a built-in image dataset generator which will greatly improve your model performance on a small dataset like this. The TensorFlow image dataset generator can scale, rotate, flip and, zoom your images which will produce huge improvements in model accuracy.
Good luck!

Dealing with highly imbalanced datasets using Tensorflow Dataset and Keras Tuner

I have a highly imbalanced dataset (3% Yes, 87% No) of textual documents, containing a title and abstract feature. I have transformed these documents into tf.data.Dataset entities with padded batches. Now, I am trying to train this dataset using Deep Learning. With model.fit() in TensorFlow, you have the class_weights parameter to deal with class imbalance, however, I am seeking for the best parameters using keras-tuner library. In their hyperparameter tuners, they do not have such an option. Therefore, I am seeking other options for dealing with class imbalance.
Is there an option to use class weights in keras-tuner? To add, I am already using the precision#recall metric. I could also try a data resampling method, such as imblearn.over_sampling.SMOTE, but as this Kaggle post mentions:
It appears that SMOTE does not help improve the results. However, it makes the network learning faster. Moreover, there is one big problem, this method is not compatible larger datasets. You have to apply SMOTE on embedded sentences, which takes way too much memory.

if you are looking for other methods to deal with imbalanced data, you may consider generating synthetic data using SMOTE or ADASYN package. This usually works. I see you have considered this as an option to explore.

You could change the evaluation metric to fbeta_scorer.(its weighted fscore)
Or if the dataset is large enough, you can try undersampling.

Non-image data augmentation

I am looking for an algorithm and-or tutorial about data augmentation but all of them belong to image augmentation , is it possible to do that in other datasets ?
I am working on parkinsons data set (https://archive.ics.uci.edu/ml/datasets/parkinsons) and want to create an example of data aug with python , is this possible ? or should i use smt like mnist/fmnist ?

If you had access to the actual voice recordings, you could apply some augmentation techniques used in speech recognition and then re-extract the features such as fundamental frequency. However, since you're dealing directly with the features, augmentation is more tricky. It is possible to generate synthetic samples by interpolating between existing ones or adding noise, but since the features are highly correlated, you need a smart way of doing that (see this paper for a simple approach and this one for a more advanced technique). If you have a class imbalance problem, you can try simply over- or under-sampling.

Is it possible to train model from multiple datasets for each class?

I'm pretty new to object detection. I'm using tensorflow object detection API and I'm now collecting datasets for my project
and model_main.py to train my model.
I have found and transformed two quite large datasets of cars and traffic lights with annotations. And made two tfrecords from them.
Now I want to train a pretrained model however, I'm just curious will it work? When it is possible that an image for example "001.jpg" will have of course some annotated bounding boxes of cars (it is from the car dataset) but if there is a traffic light as well it wouldn't be annotated -> will it lead to bad learning rate? (there can be many of theese "problematic" images) How should I improve this? Is there any workaround? (I really don't want to annotate the images again)
If its stupid question I'm sorry, thanks for any response - some links with this problematic would be the best !
Thanks !

The short answer is yes, it might be problematic, but with some effort you can make it possible.
If you have two urban datasets, and in one you only have annotations for traffic lights, and in the second you only have annotations for cars, then each instance of car in the first dataset will be learned as false example, and each instance of traffic light in the second dataset will be learned as false example.
The two possible outcomes I can think of are:
The model will not converge, since it tries to learn opposite things.
The model will converge, but will be domain specific. This means that the model will only detect traffic lights on images from the domain of the first dataset, and cars on the second.
In fact I tried doing so myself in a different setup, and got this outcome.
In order to be able to learn your objective of learning traffic lights and cars no matter which dataset they come from, you'll need to modify your loss function. You need to tell the loss function from which dataset each image comes from, and then only compute the loss on the corresponding classes (/zero out the loss on the classes do not correspond to it). So returning to our example, you only compute loss and backpropagate traffic lights on the first dataset, and cars on the second.
For completeness I will add that if resources are available, then the better option is to annotate all the classes on all datasets in order to avoid the suggested modification, since by only backpropagating certain classes, you do not enjoy using actual false examples for other classes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.