Generated data for transfer leaning: CNN Not progressing - python

I am trying to produce a simple CNN with data I have generated and have been struggling for a few days now. I simply cannot get it to fit the data at all. After reading online I am assuming there is a data issue somewhere but I cannot find it. I have tried multiple combinations of data manipulation and model changes (more or fewer parameters) with no effect. I am looking at the data going in and it seems fine to me, I've look over it multiple times with nothing unusual coming up.
My outputs for the model are essentially nothing. no increase at all in validation accuracy.
SOLVED
See Below

SOLVED
For those struggling like I was:
DATA
DATA
DATA
DATA
Better Data, More Data, smaller models, and more training.
I have 1000 real data examples but I generated 30k fake examples. After training on the fake example transfer learning gave me an immediate 87% accuracy.
Overall I highly recommend this method if you have little data and custom problems that you cannot find premade models for.
Check out my generated and real data below.

Related

Update PyCaret Anomaly detection Model

I'm detecting anomalies in a time series data using pycaret. I'm taking in the data at every call, detecting and returning it. Everything is fine, but when coming to improving the performance, I'm planning to load the saved model, re-train it with less data(say daily instead of getting some 1000 days of data at once) and save the model again. Here its performance increases a lot, as it is training on only less data.
The problem is to update/re-train the model. I couldn't find any method to update the model.
Base Initially:
setup(dataframe)
model=createmodel(modelName)
results=assign_model(model)
What I'm trying to do.
try loading the model if already present.
setup(data_frame_new)
if model.exists:
retrain_model
else:
model=createmodel(modelName)
save_model(model)
results=assign_model(model)
So, now I have trained model and new data, how can I integrate both.
Is there any way to retrain the model? I couldn't see any documentation on that so far.
Or I might have overlooked. Please put forth your valuable comments to let me know how to achieve this.

CNN feature extraction having multiple column classes

I have a dataset which consists of power signals and the targets are multiple house appliances that might be on or off. I only need to do feature extraction on the signal using cnn and then save the dataset as a csv file to use it with another Machine learning method.
I used CNN before for classification on signals consisting of 6 classes. However, i am a bit confused and i need your help. I have two questions (might be stupid and am sorry)
Do i need the target variable in order to do feature extraction?
The shape of my dataset is for example 40000x100. I need my extracted dataset (the features learned using CNN) to have the same amount of rows (i.e. 40000). How can i do that?
I know that the answer might be simpler than i think but at the moment i feel quite lost.
I would appreciate any help.
Thanks

Efficient way to compare effects of adding/removing multiple data-cleaning steps on the performance of deep learning model?

Somewhat a beginner here on deep learning using Python & Stack Overflow.
I am currently working on something similar to a sentiment analysis of community posts using LSTM, and have been trying to add preprocessing steps to clean up the text data.
I have lots of ideas - say, 7 - for modifying/dropping certain data without sacrificing context that I think could improve my prediction accuracy, but I want to be able to see exactly how implementing one or some of these ideas can affect the prediction accuracy.
So is there a tool, statistical method or technique that I can use that will drastically cut down on the number of experiments (training the model + predicting on test set) that I need to do to see how "toggling on" one, two, or several of these preprocessing steps can affect my prediction accuracy, instead of having to do like 49 experiments and filling out the results on a 7x7 table? I have used the Taguchi method of design of experiments on a different kind of problem before, but not sure it can be applied properly here since the neural network will be trained in a completely different way based on the data it is fed.
Thank you for any input and advice!

Spacy train ner using multiprocessing

I am trying to train a custom ner model using spacy. Currently, I have more than 2k records for training and each text consists of more than 100 words, at least more than 2 entities for each record. I running it for 50 iterations.
It is taking more than 2 hours to train completely.
Is there any way to train using multiprocessing? Will it improve the training time?
Short answer... probably not
It's very unlikely that you will be able to get this to work for a few reasons:
The network being trained is performing iterative optimization
Without knowing the results from the batch before, the next batch cannot be optimized
There is only a single network
Any parallel training would be creating divergent networks...
...which you would then somehow have to merge
Long answer... there's plenty you can do!
There are a few different things you can try however:
Get GPU training working if you haven't
It's a pain, but can speed up training time a bit
It will dramatically lower CPU usage however
Try to use spaCy command line tools
The JSON format is a pain to produce but...
The benefit is you get a well optimised algorithm written by the experts
It can have dramatically faster / better results than hand crafted methods
If you have different entities, you can train multiple specialised networks
Each of these may train faster
These networks could be done in parallel to each other (CPU permitting)
Optimise your python and experiment with parameters
Speed and quality is very dependent on parameter tweaking (batch size, repetitions etc.)
Your python implementation providing the batches (make sure this is top notch)
Pre-process your examples
spaCy NER extraction requires a surprisingly small amount of context to work
You could try pre-processing your snippets to contain 10 or 15 surrounding words and see how your time and accuracy fairs
Final thoughts... when is your network "done"?
I have trained networks with many entities on thousands of examples longer than specified and the long and short is, sometimes it takes time.
However 90% of the increase in performance is captured in the first 10% of training.
Do you need to wait for 50 batches?
... or are you looking for a specific level of performance?
If you monitor the quality every X batches, you can bail out when you hit a pre-defined level of quality.
You can also keep old networks you have trained on previous batches and then "top them up" with new training to get to a level of performance you couldn't by starting from scratch in the same time.
Good luck!
Hi I did same project where I created custom NER Model using spacy3 and extracted 26 entities on large data. See it really depends like how are you passing your data. Follow the steps I am mentioning below might it could work on CPU:
Annotate your text files and save into JSON
Convert your JSON files into .spacy format because this is the format spacy accepts.
Now, here is the point to be noted that how are you passing and serializing your .spacy format in spacy doc object.
Passing all your JSON text will take more time in training. So you can split your data and pass iterating it. Don't pass consolidated data. Split it.

Is this important to clean test data?

In the training data, I did feature engineering and clean my data. Is this important to do the same with test data?
I know some basic modifications like label encoding, dependent/independent feature split, etc.. are required in test data as well. But do we really need to CLEAN the test data before we do the predictions?
I can't answer you with Yes or No so let me start with the data distribution on all of your Train/Test/Dev set.
According to Prof.Andrew ng, the Test and Dev set should come from the same distribution (Youtube), but the trainig set can come from a different distribution (Check it here), and often it's a good thing to do.
Sometimes cleaning the trainig set is very useful and also applying some basic operation for speeding the training process(Like Normalization which is not cleaning), but we are talking about training data which can and should have thousands of thousands of examples, so sometimes you can't check your data manually and clean it, Because it's maybe not worthy at all;
What do I mean? well let me show you an example:
Let's say you're bulding a cat classifier (Cat or no-Cat), and you have an accuracy of 90%, which means that you've 10% Error.
after doing Error-analysis(Check it here) you find out that:
6% of your error is caused of misslabeled images,(No-cat images
labeled as cat and viceversa).
44% is caused of Blurry images.
50% is caused by images of Big Cats labeled as cats.
in this case all the time your will spend fixing the misslabeled images will improve your performance (0.6%) in the best scenario (Because it's 6% from the whole 10% error), so IT'S NOT WORTHY correcting the misslabeld data.
I gave an example on misslabeled data, but in general I mean any type of cleaning and fixing.
BUT cleaning the data in the test set may be easier, and it should be done both to Test/Dev sets if it's possibile because your test set will reflect the performance of your system on the real time data.
The Operations that you mentioned in your question are not quite cleaning but used for speeding up the process of learning or make the data apprpriate for the algorithm, and applying them depends on the shape and type of the data(images, Voice Records, words..), and on the problem you're trying to solve.
in the end As an answer, I can tell you that:
the form and shape of the data should be the same in all the three
sets (so applying label encoding should be for the whole data, not just
for the training data, and also for the input data used for
prediction because it changes the shape of the output label).
The number of features should be the same always.
Any operation that changes the (shape, form, number of features, ...) applied to the data should be applied on every single sample that you're gonna use in your system.
It depends:
Normalizing the data: If you normalized your training data, then yes, normalize the test data in exactly the way you normalized the training data. But be careful that you do not re-tune any parameters you tuned on the training data.
Filling missing values: idem. Treat the test data as the training data but do not re-tune any of the parameters.
Removing outliers: probably not. The aim of the test set is to make an estimate about how well your model will perform on unseen data. So removing outliers will probably not be a good idea.
In general: only do things to your test data that you can/will also do on unseen data upon applying your model.

Categories