Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on a text classification problem. I have huge amount of data and when I am trying to fit data into the machine learning model it is causing a memory error. Is there any way through which I can fit data in parts to avoid memory error.
Additional information
I am using linearSVC model.
I have training data of 1.1 million rows.
I have vectorized text data using tfidf.
The shape of vectorized data (1121063, 4235687) which has to be
fitted into the model.
Or is there any other way out of this problem.
Unfortunately, I don't have any reproducible code for the same.
Thanks in advance.
The simple answer is not to use what I assume is the scikit-learn implementation of linearSVC and instead use some algorithm/implementation that allows training in batches. Most common of which are neural networks, but several other algorithms exists. In scikit-learn look for classifiers with the partial_fit method which will allow you to fit your classifier in batches. See e.g. this list
You could also try what's suggested from sklearn.svm import SVC (the second part, the first is using LinearSVC, which you did):
For large datasets consider using :class:'sklearn.svm.LinearSVC' or
:class:'sklearn.linear_model.SGDClassifier' instead, possibily after a :class:'sklearn.kernel_approximation.Nystroem' transformer.
If you check SGDClassifier() you can set the parameter "warm_start=True" so when you iterate trough your dataset it won't lose it's state.:
clf = SGDClassifier(warm_start=True)
for i in 'loop your data':
clf.fit(data[i])
Additionally you could reduce the dimension of your dataset by removing some words from your TFIDF model. Check the "max_df" and "min_df" parameters, they'll remove words with frequency higher than or lower than, can be a % or an unit.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I am begginer in NLP and I have some questions about a classification task. I have a data set in data frame structure which contains two columns, the first on is the texts (so strings) and the second one in the label of each test. So let's say the first column x_train and the seonc one y_train. In order to apply an MLP I could use this code
Tfidf_vect = TfidfVectorizer(max_features = 5000)
Tfidf_vect.fit(input_text)
Train_X_Tfidf = Tfidf_vect.transform(x_train)
Test_X_Tfidf = Tfidf_vect.transform(x_test)
I want to try the Word2Vec model, but I don't know how to transform my training data into number by using Word2vec. So then I could apply again the MLP model. I would be grateful if you could help me.
According to the documentation from sklearn,
max_features int, default=None
If not None, build a vocabulary that
only consider the top max_features ordered by term frequency across
the corpus.
It means that based on your texts, the TfidfVectorizer will build a vocabulary that contains the top 'max_features' most frequently appeared token (word or character). For example, using a word level, set max_features = 10, it will take the 10 most commonly appeared word in your texts as its vocabulary. As for how many features you want to use, it depends on the number of words in your texts. Most common choice is 10000, though.
As for your second question, aside from Gensim's Word2Vec, you could try Keras Embedding layer. A good tutorial is posted on tensorflow website here.
What do you mean by "transform my training data into number by using Word2vec"? If you are referring to obtaining an embedded representation of a given text, you can use Gensim's Word2Vec. In the documentation you will find some examples of usage of the model
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I have a database that I have split into train and test datasets, fitted a XGBoost model on the train set, and made predictions using the fitted model on the test set. so far everything is good.
Now if I save the fitted model and want to use it on a completely new dataset to make predictions, what should my new database look like?
Does it have to contain the exact number of features?
Does a categorical feature have to have the same categories in both databases?
I assume, you are using one-hot encoding for lets say the color-feature?
So technically to avoid extra or new features in the test-data, you should form the feature-vector using train+test data.
Do one-hot encoding/featurization on the whole set of training+testing data. Now separate out training-dataset and testing-dataset.
Lets say [v1, v2, v3... vn] are the list of feature-names from train+test data.
Now form the training-data using this feature-name. As expected the feature-column corresponding to 5th color in the training-data would all be zero and THATS FINE
Use this same features-list for the test-data, now you shouldnt have any discrepancies in terms of new features coming up.
Hope that clarifies.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a ML-based Intrusion prediction. In the learning process, I used training and test data both labeled to evaluate the accuracy and generate confusion matrixes. I came up with good accuracy and now I want to test it with new data( Unlabeled data). How do I do that?
Okay so say you do test on unlabeled data and your algorithm predicts some X output. How can you check the accuracy, how can you check if this is correct or not? This is the only thing that matters in predictions, how your program works on data it has not seen before.
The short answer is, you can't. You need to split your data into:
Training 70%
Validation 10%
Test 20%
All of these should be labled and accuracy, confusion matrix, f measure and anything else should be computed on the labled test data that your program has not seen before. Your train on training data and every once in a while you check the performance on the validation data to see if it is doing well or if you need to do adjustments. In the very end you check on test data. This is supervised learning, you always need labeled data.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Could you please explain what the "fit" method in scikit-learn does? Why is it useful?
In a nutshell: fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a .predict() method call.
To elaborate: Fitting your model to (i.e. using the .fit() method on) the training data is essentially the training part of the modeling process. It finds the coefficients for the equation specified via the algorithm being used (take for example umutto's linear regression example, above).
Then, for a classifier, you can classify incoming data points (from a test set, or otherwise) using the predict method. Or, in the case of regression, your model will interpolate/extrapolate when predict is used on incoming data points.
It also should be noted that sometimes the "fit" nomenclature is used for non-machine-learning methods, such as scalers and other preprocessing steps. In this case, you are merely "applying" the specified function to your data, as in the case with a min-max scaler, TF-IDF, or other transformation.
Note: here are a couple of references...
fit method in python sklearn
http://scikit-learn.org/stable/tutorial/basic/tutorial.html
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm wondering if it is possible to include scikit-learn outlier detections like isolation forests in scikit-learn's pipelines?
So the problem here is that we want to fit such an object only on the training data and do nothing on the test data. Particularly, one might want to use cross-validation here.
How could a solution look like?
Build a class that inherits from TransformerMixin (and BaseEstimator for ParameterTuning).
Now define a fit_transform function that stores the state if the function has been called yet or not. If it hasn't been called yet, the function fits and predicts the outlier function on the data. If the function has been called before, the outlier detection already has been called on the training data, thus we assume that we now find the test data which we simply return.
Does such an approach have a chance to work or am I missing something here?
Your problem is basically the outlier detection problem.
Hopefully scikit-learn provides some functions to predict whether a sample in your train set is an outlier or not.
How does it work ? If you look at the documentation, it basically says:
One common way of performing outlier detection is to assume that the regular data come from a known distribution (e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can define outlying observations as observations which stand far enough from the fit shape.
sklearn provides some functions that allow you to estimate the shape of your data. Take a look at : elliptic envelope and isolation forests.
As far as I am concerned, I prefer to use the IsolationForest algorithm that returns the anomaly score of each sample in your train set. Then you can take them off your training set.