Using pretrained feature-extractor in pipeline - python

I'm new to scikit-learn, and can't find the answer to what I think is a common use-case. I have lots of unlabelled data, and only some labelled data. I want to first train a transformer for feature-extraction, then save that part of the pipeline somewhere. Then, I'd like to create a pipeline of the feature-extractor plus a classifier, which I'll train on labelled data.
But, I don't want the feature-extractor to re-fit on the new data -- I want it to keep its parameters from when I trained it on the unlabelled data.
Can somebody point me in the direction of how best to do this? Thank you.

Related

scikit-learn LogisticRegression Classify another value

i'm new to python and have to make a natural language processing task.
Using a kaggle dataset a sentiment classify should be implemented using python.
For this i'm using a dataframe and the LogisticRegression, as described in this article and everythin works fine.
Now i want to know if it is possible to classify another string which is not in the dataset, so that i can experiment with the classifier interactively.
Is this possible?
Thank you!
You will have to manually run all the preprocessing on youur new data, than predict.
That is:
So first (Data Cleaning) and other functions which you've called which edit the data,
then run the (Create a bag of words) part, and only
Then use the fitted LR model to predict on this (preprocessed) data.
Yes, this is possible.
To make this more modular, you can create a function and pass input string to that function for preprocessing. This could reduce the code redundancy. For train data preprocessing also, you can directly pass data to that function.
Once that is done, you need to create Bag of Words for the test sentence.
Then you can use predict function for trained LR model to predict the output.
Thank You.

How to classify unlabelled data?

I am new to Machine Learning. I am trying to build a classifier that classifies the text as having a url or not having a url. The data is not labelled. I just have textual data. I don't know how to proceed with it. Any help or examples is appreciated.
Since it's text, you can use bag of words technique to create vectors.
You can use cosine similarity to cluster the common type text.
Then use classifier, which would depend on number of clusters.
This way you have a labeled training set.
If you have two cluster, binary classifier like logistic regression would work.
If you have multiple classes, you need to train model based on multinomial logistic regression
or train multiple logistic models using One vs Rest technique.
Lastly, you can test your model using k-fold cross validation.
You cannot train a classifier with unlabeled data. You need labeled examples. There are services that will label it for you, but it might be simpler for you to do it by hand (I assume you can go through one per minute).
Stack Overflow is for programming; this question would be better suited in, say, Cross-Validated. Maybe they'll have better suggestions than me.
After you've labeled the data, there's a lot of info on the web on this subject - for example, this blog is a good place to start if you already have some grip on the issue.
Good luck!

How to launch a Machine Learning model?

First of all thank you for taking your time to read my question. I have done a Machine Learning model with a dataset (The famous one about Cancer) and I want to know how can I do to predict the results for new variables. I think that I have to keep training the data (often) to have more accured data to use in my prediction but for predicting new data, ¿Is as simple as changing the test data (y variable) to the new data?
Thank you so much for taking your time and any help would be appreciate it.
You are probably using the SVC class from sklearn.svm.
After fitting the model with the fit method you can predict new data with the predict method. See here.
By the way: For Support Vector Machines you don't have to fit your data multiple times. Maybe you are confusing that with neural networks.
If you are talking in the sense that you are changing the number of features in your test data then you cannot do that.
The number of features has to be the same in training and test set.
However, if your test data have some class of categorical variable which was not there in training data then its better you train your model with one extra category as "NONE" of "Others" for all your features.
This way when you encounter new class of categorical variable in your test data then you changed it to "NONE" or "Others" and do prediction on your trained model.
This way it will not break your model.
I hope I understand your question correctly.

add training data to existing LinearSVC

I am scraping approximately 200,000 websites, looking for certain types of media posted on the websites of small businesses. I have a pickled linearSVC, which I've trained to predict the probability that a link found on a web page contains media of the type that I'm looking for, and it performs rather well (overall accuracy around 95%). However, I would like the scraper to periodically update the classifier with new data as it scrapes.
So my question is, if I have loaded a pickled sklearn LinearSVC, is there a way to add in new training data without re-training the whole model? Or do I have to load all of the previous training data, add the new data, and train an entirely new model?
You cannot add data to SVM and achieve the same result as if you would add it to the original training set. You can either retrain with extended training set starting with the previous solution (should be faster) or train on new data only and completely diverge from the previous solution.
There are only few models that can do what you would like to achieve here - like for example Ridge Regression or Linear Discriminant Analysis (and their Kernelized - Kernel Ridge Regression or Kernel Fischer Discriminant, or "extreme"-counterparts - ELM or EEM), which have a property of being able to add new training data "on the fly".

predicting new non-standardized data with classifier trained on standardized data

I have some data with say, L features. I have standardized them using StandardScaler() by doing a fit_transform on X_train. Now while predicting, i did clf.predict(scaler.transform(X_test)). So far so good... now if I want to pickle the model for later reuse, how would I go about predicting on the new data in future with this saved model ? the new (future) data will not be standardized and I didn't pickle the scaler.
Is there anything else that I have to do before pickling the model the way I am doing it right now (to be able to predict on non-standardized data)?
reddit post: https://redd.it/4iekc9
Thanks. :)
To solve this problem you should use a pipeline. The first stage there is scaling, and the second one is your model. Then you can pickle the whole pipeline and have fun with your new data.

Categories