Python Memory Error - Sklearn Huge Input Data?

Python Memory Error - Sklearn Huge Input Data? - python

I need to train the svm classifier in sklearn. The dimensions of the feature vectors go in hundreds of thousands and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?

I need to train the svm classifier in sklearn.
You mean sklearn.svm.SVC? For high dimensional sparse data and many samples, LinearSVC, LogisticRegression, PassiveAggressiveClassifier or SGDClassifier can be much faster to train for comparable predictive accuracy.
The dimensions of the feature vectors go in lakhs and there are tens of thousands of such feature vectors. However, each dimension can be 0, 1 or -1. Only some 100 are non-zero in each feature vector. Any efficient way to give the info about the feature vectors to the classifier?
Find a way to load your data as a scipy.sparse matrix that does not store the zeros in memory. Have a look at the documentation on feature extraction. It will give you tools to do that depending on the nature of the representation of the original data.

Related

Pre Processing spiral dataset to use for Logistic Regression

So I need to classify a spiral dataset. I have been experimenting with a bunch of algorithms like KNN, Kernel SVM, etc. I would like to try to improve the performance of Logistic Regression using feature engineering, preprocessing, etc.
I am also using scikit learn to do all of the classifications.
I fully understand Logistic Regression is not the proper algorithm to do this sort of problem. This is more of a learning excerise for Pre processing and other feature engineering/extraction methods to see how much I can improve this specific model.
Here is an example dataset I would use for the classification. Any suggestions of how I can manipulate the dataset to use in the Logistic Regression algorithm would be helpful.
I also have datasets with multiple spirals as well. some datasets have 2 classes or sometimes up to 5. This means up to 5 spirals.

Logistic Regression is generally used as a linear classifier i.e the decision boundary separating one class samples from the other is a linear(straight-line) but it can be used for non-linear decision boundaries as well.
Using the kernel trick in SVC is also good option as it maps the data in the lower dimension to higher dimension making it linearly separable.
example:
In the above example, the data is not linearly separable in lower dimension, but after applying the transformation ϕ(x) = x² and adding the second dimension to the features we have the right side graph that becomes linearly separable.
You can start transforming the data by creating new features for applying logistic regression.
Also try SVC(Support Vector Classifier) that uses kernel trick. For SVC you don't have to transform the data into higher dimensions explicitly.
There are few resources which are great for learning are one and two

Since the data doesn't seem to be linearly separable, you can try using the Kernel Trick method commonly used in Support Vector Classification. The kernel function accepts inputs in the original lower-dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. That means transformed vector ϕ(x) is just some function of the coordinates in the corresponding lower-dimensional vector x.

Scoring increasing with number of components using PCA

I recently started working in the field of machine learning and stuff related to it using python. Today I'm working on a dataset where I would like to apply a dimension reduction and apply my model to evaluate the score. This dataset got 30 features.
I start with a simple algorithm which is the Logistic Regression but before applying my logistic regression I want to do a PCA.
To determine which number of components is the best I used the gridsearchCV with my logistic regression only playing with the C parameter and my PCA where I choose the number of components.
The result I got is that the more components I use for my PCA the better is the precision score. For my example with n_components=30 I get a precision score of 0.81.
The problem is that I thought PCA is used for dimension reduction (i.e working with fewer features) and that it could help increasing score. Is there something I do not understand?
pca = PCA()
logistic = LogisticRegression()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
param_grid = {
'pca__n_components': [5,10,15,20,25,30],
'logistic__C': [0.01,0.1,1,10,100]
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1, scoring='precision') # fix adding a tuple scoring
search.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
results = pd.DataFrame(search.cv_results_)
output : Best parameter (CV score=0.881):
{'logistic__C': 0.01, 'pca__n_components': 30}
Thanks in advance for your reply
EDIT: I add this screenshot for more information on the score with number of components

In general, when you do dimension reduction, you lose some information. It is not surprising then that you get a higher score with the full set of PCA features. Working with few features could indeed help increase the score but not necessarily, there are also other good reasons for using PCA for dimension reduction. Here are the main advantages of PCA:
PCA is one good technique for dimension reduction (with its own limitations) in the sense that it concentrate the variance of the dataset in the first dimensions of the computed new space. Hence, dropping the last features is done at a minimal cost in terms of information carried by the dataset (under certain hypotheses). Using PCA for dimension reduction mitigates the risk of overfitting by limiting the number of features, while losing a minimal amount of information. In this sense, less features can increase the score by avoiding overfitting but that is not always true.
Dimension reduction with PCA can also be useful when working with noisy data. PCA will not directly eliminate the noise, but the first few features will have a higher signal-to-noise ratio since the variance of the dataset is concentrated there. The last features may be then dominated by noise and dropped.
Since PCA projects the dataset on a new orthonormal basis, the new features will be all independant from each other. This property is often required by a lot of machine learning algorithms to achieve optimal performance.
Of course, PCA should not be used in any case as it has its own hypotheses and limitations. Here are what I consider the main ones (non exhaustive):
PCA is sensitive to the scaling of the variables. As an example, if you have a temperaturecolumn in your dataset, you will get a different transformation depending on whether you use Celsius or Fahrenheit as the unit because their scale are different. When the variables have different scales, PCA is a bit arbitrary. This can be corrected by scaling all variables to unit variance, but at the cost of modifying (compressing or expanding) the fluctuations of the variables in all dimensions.
PCA captures linear correlations between between the features but fails to capture non-linear correlations.
What would be interesting in your case would be to compare the score obtained with and without the PCA transformation. You would see then if there is a benefit in using it.
Last but not least, your plot shows an interesting thing. The gain in the score between 20 and 30 features is very low (1% ?). You can wonder whether it is worth keeping ten additional features for this very low gain. Indeed, keeping more features increases the risk of having a model with a lower ability to generalize. Cross validation mitigates already this risk, but there are no guarantees that when you apply the model on unseen data, this unseen data will have the exact same properties as your training dataset.

Accuracy with TF-IDF and non-TF-IDF features

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.
In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.
Around 500 of them are the non-TF-IDF features.
The issue is that the accuracy of the Random Forest on the same test set etc with
- only the non-TF-IDF features is 87%
- the TF-IDF and non-TF-IDF features is 76%
This significant aggravation of the accuracy raises some questions in my mind.
The relevant piece of code of mine with the training of the models is the following:
drop_columns = ['labels', 'complete_text_1', 'complete_text_2']
# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values
# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])
vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])
# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)
# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])
# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)
Personally, I have not seen any bug in my code (this piece above and in general).
The hypothesis which I have formulated to explain this decrease in accuracy is the following.
The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.
Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).
Can you explain differently the decrease in accuracy at my classifier?
In any case, what would you suggest doing?
Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.
One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features.
Then the results of these two models will be combined either by (weighted) voting or meta-classification.

Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.
If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).
If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).

My guess is that your hypothesis is partly correct.
When using the full dataset (in the 130K feature model), each split in the tree uses only a small fraction of the 500 non-TF-IDF features. So if the non-TF-IDF features are important, then each split misses out on a lot of useful data. The data that is ignored for one split will probably be used for a different split in the tree, but the result isn't as good as it would be when more of the data is used at every split.
I would argue that there are some very important TF-IDF features, too. The fact that we have so many features means that a small fraction of those features is considered at each split.
In other words: the problem isn't that we're weakening the non-TF-IDF features. The problem is that we're weakening all of the useful features (both non-TF-IDF and TF-IDF). This is along the lines of Alexander's answer.
In light of this, your proposed solutions won't solve the problem very well. If you make two random forest models, one with 500 non-TF-IDF features and the other with 125K TF-IDF features, the second model will perform poorly, and negatively influence the results. If you pass the results of the 500 model as an additional feature to the 125K model, you're still underperforming.
If we want to stick with random forests, a better solution would be to increase the max_features and/or the number of trees. This will increase the odds that useful features are considered at each split, leading to a more accurate model.

Weighting specific features in TF-IDF feature vectors for k-means clustering and cosine similarity

I have an array of TF-IDF feature vectors. I'd like to find similar vectors in the array using two methods:
Cosine similarity
k-means clustering
Using Scikit Learn, this process is pretty simple.
Now I'd like to weight certain features so that they will influence the results more than the other features. For example, I might like to weight the first 100 elements of the TF-IDF vectors so that those features are more indicative of similarity than the rest of the features.
How can I meaningfully weight certain features in my feature vectors? Is the process for weighting certain features the same for each of the similarity algorithms I listed above?

As I understand, low values in the TFIDF matrix mean that the words are less significant. So one approach is to lower the values in the matrix for those columns you considered.
The arrays in scikit are sparse, so for testing and debugging you might want to convert to regular matrix. I also used xlsxwriter to get an overview to what is really happening when applying TFIDF and KMeans++ (see) https://www.dbc-enterprise-it-consulting.com/text-classifier/.

Trying to avoid .toarray() when loading data into an SVC model in scikit-learn

I'm trying to plug a bunch of data (sentiment-tagged tweets) into an SVM using scikit-learn. I've been using CountVectorizer to build a sparse array of word counts, and it's all working fine with smallish data sets (~5000 tweets). However, when I try to use a larger corpus (ideally 150,000 tweets, but I'm currently exploring with 15,000), .toarray(), which converts a sparse format to a denser format, immediately starts taking up immense amounts of memory (30k tweets hit over 50gb before the MemoryError.
So my question is -- is there a way to feed LinearSVC() or a different manifestation of SVM a sparse matrix? Am I necessarily required to use a dense matrix? It doesn't seem like a different vectorizer would help fix this problem (as this problem seems to be solved by: MemoryError in toarray when using DictVectorizer of Scikit Learn). Is a different model the solution? It seems like all of the scikit-learn models require a dense array representation at some point, unless I've been looking in the wrong places.
cv = CountVectorizer(analyzer=str.split)
clf = svm.LinearSVC()
X = cv.fit_transform(data)
trainArray = X[:breakpt].toarray()
testArray = X[breakpt:].toarray()
clf.fit(trainArray, label)
guesses = clf.predict(testArray)

LinearSVC.fit and its predict method can both handle a sparse matrix as the first argument, so just removing the toarray calls from your code should work.
All estimators that take sparse inputs are documented as doing so. E.g., the docstring for LinearSVC states:
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vector, where n_samples in the number of samples and
n_features is the number of features.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.