I am scraping approximately 200,000 websites, looking for certain types of media posted on the websites of small businesses. I have a pickled linearSVC, which I've trained to predict the probability that a link found on a web page contains media of the type that I'm looking for, and it performs rather well (overall accuracy around 95%). However, I would like the scraper to periodically update the classifier with new data as it scrapes.
So my question is, if I have loaded a pickled sklearn LinearSVC, is there a way to add in new training data without re-training the whole model? Or do I have to load all of the previous training data, add the new data, and train an entirely new model?
You cannot add data to SVM and achieve the same result as if you would add it to the original training set. You can either retrain with extended training set starting with the previous solution (should be faster) or train on new data only and completely diverge from the previous solution.
There are only few models that can do what you would like to achieve here - like for example Ridge Regression or Linear Discriminant Analysis (and their Kernelized - Kernel Ridge Regression or Kernel Fischer Discriminant, or "extreme"-counterparts - ELM or EEM), which have a property of being able to add new training data "on the fly".
Related
Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.
I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?
You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.
No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets
I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!
I use Gaussian process regression in Python. I have big data for training and try to predict test data. The trained data will not vary, but test data will. My question is that if it is possible to save the results of training and whenever new test data come in, just quickly predict the target of the test data without retraining all over again. I would appreciate any help.
Thanks,
Jay
I'm building a NaiveBayes classifier using scikit-learn, and so far things are going well if I have a set body of data to train. However, for the particular project I'm working on, there will be new data coming in every day that ideally would be part of the training set.
I'm aware that you can pickle the classifier to store it for later use, but is there any way to "update" the classifier with new data?
Re-training the classifier from scratch every day is obviously an option, but that would require drawing a lot of historical data each time, for a growing time period.
Use the partial_fit method on the naive Bayes estimator.