BERTopic / umap crashing with cosine metric - python

I'm using python bertopic library for topic modeling of Polish tweets in a straightforward way, i.e.,
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
model = SentenceTransformer("sdadas/st-polish-paraphrase-from-distilroberta")
topic_model = BERTopic(embedding_model = model)
topics, probs = topic_model.fit_transform(docs)
where docs is a set of over 800k tweets.
The problem I encounter is that in case I restrict myself to 100k documents there are no problems, however, if I exceed this number I end up with a segmentation fault.
I have dug a little bit and figured out that probably the main cause of the problem is the umap package that provides the dimensionality reduction algorithm, i.e., if I first obtain the embeddings with model.encode() function and perform
umap_embeddings = umap.UMAP(n_neighbors=15, n_components=5,metric='cosine').fit_transform(embeddings)
I end up with the same result of seg fault. However, if I switch the metric to euclidean, then it all works fine with the full set of 800k documents. To be funnier, cosine metric sometimes works with 300k, 500k or even 600k documents but not always (umap is not deterministic).
I understand that the main reason could be memory issues, however, I get similar results when tested on 64GB RAM standalone PC and 378GB RAM Epyc server, even with memory_low=True option.

Related

SciKit Learn SVR runs very long

I'm facing the following problem, I'm running a SVR from the scikit-learn library on a training set with about 46500 obsevations and it runs more than six hours, until now.
I'm using the linear kernel.
def build_linear(self):
model = SVR(kernel='linear', C=1)
return model
I already tried changing the "C" value between 1e-3 and 1000 nothing changes.
The poly kernel runs in about 5 minutes, but I need the values for an evaluation and can skip this part...
Does anyone got an idea how to speed this up?
Thanks a lot!
SVMs are known to scale badly with the number of samples!
Instead of SVR with a linear-kernel, use LinearSVR or for huge data: SGDClassifier
LinearSVR is more restricted in terms of what it can compute (no non-linear kernels) and more restricted algorithms usually have more assumptions and use these to speed-up things (or save memory).
SVR is based on libsvm, while LinearSVR is based on liblinear. Both are well-tested high-quality implementations.
(It might be valuable to add: don't waste time in general cases like these waiting 6 hours. Sub-sample your data and try smaller, less small, ... examples and deduce runtime or problems from that. edit: it seems you did that already, good!).

Handling K-means with large dataset 6gb with scikit-learn?

I am using scikit-learn. I want to cluster a 6gb dataset of documents and find clusters of documents.
I only have about 4Gb ram though. Is there a way to get k-means to handle large datasets in scikit-learn?
Thank you, Please let me know if you have any questions.
Use MiniBatchKMeans together with HashingVectorizer; that way, you can learn a cluster model in a single pass over the data, assigning cluster labels as you go or in a second pass. There's an example script that demonstrates MBKM.
Clustering is not in itself that well-defined a problem (a 'good' clustering result depends on your application) and k-means algorithm only gives locally optimal solutions based on random initialization criteria. Therefore I doubt that the results you would get from clustering a random 2GB subsample of the dataset would be qualitatively different from the results you would get clustering over the entire 6GB. I would certainly try clustering on the reduced dataset as a first port of call. Next options are to subsample more intelligently, or do multiple training runs with different subsets and do some kind of selection/ averaging across multiple runs.

Save Naive Bayes Classifier in memory

I am new in NLTK and machine learning. I'm using Python with NLTK Naive Bayes Classifier . I have create a Naive Bayes Classifier for text classification using NLTK and save it on disk. I am also able to load it when needed to classify some test data by using this python code:
import pickle
f = open('classifier.pickle')
classifier = pickle.load(f)
f.close()
But my problem is that whenever an new test data come , I have to load this classifier again and again in memory that takes lots of time (2-3 min) to load as it have large size. Also if I have to run two instances of the same sentimental analysis program, that will take double RAM as both program will load this classifier separately. My questions is: Is there any technique to store this classifier in memory so that whenever needed the sentimental anylysis programs can read this directly from memory or is there any other method through which the load time of the classifier can be minimize.
Thanks in advance for your help.
You can't have it both ways. You can either keep pickling/unpickling one at a time to use less RAM, or you can store both in memory, using twice as much ram, but reducing load times and disk i/o wait times.
Are the two classifiers trained using different training data, or are you using the same classifier in parallel? It sounds like the latter from your usage of "two instances", and in that case you may want to look into threading to allow the same classifier to work with two sets of data (some parallelism may be achieved by classify some of the data, then doing some other stuff like results processing to allow the other thread to classify, repeat).
My expertise in this comes from having started an open source NLTK based sentiment analysis system: https://bitbucket.org/tommyjcarpenter/evopminer.

Using sklearn and Python for a large application classification/scraping exercise

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ.
The research framework is as follows:
1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler)
2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result
My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be?
Or perhaps Python is enough if accompanied with some parallelization of the code?
Thanks
Use the HashingVectorizer and one of the linear classification modules that supports the partial_fit API for instance SGDClassifier, Perceptron or PassiveAggresiveClassifier to incrementally learn the model without having to vectorize and load all the data in memory upfront and you should not have any issue in learning a classifier on hundreds of millions of documents with hundreds of thousands (hashed) features.
You should however load a small subsample that fits in memory (e.g. 100k documents) and grid search good parameters for the vectorizer using a Pipeline object and the RandomizedSearchCV class of the master branch. You can also fine tune the value of the regularization parameter (e.g. C for PassiveAggressiveClassifier or alpha for SGDClassifier) using the same RandomizedSearchCVor a larger, pre-vectorized dataset that fits in memory (e.g. a couple of millions of documents).
Also linear models can be averaged (average the coef_ and intercept_ of 2 linear models) so that you can partition the dataset, learn linear models independently and then average the models to get the final model.
Fundamentally, if you rely on numpy, scipy, and sklearn, Python will not be a bottleneck as most critical portions of those libraries are implemented as C-extensions.
But, since you're scraping millions of sites, you're going to be bounded by your single machine's capabilities. I would consider using a service like PiCloud [1] or Amazon Web Services (EC2) to distribute your workload across many servers.
An example would be to funnel your scraping through Cloud Queues [2].
[1] http://www.picloud.com
[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

Error in joblib.load file loading

I am using Random Forest Regressor python's scikit-learn module for predicting some values. I used joblib.dump for saving models. Therea 24 joblib.dump files, and each weights 45 megabyte (sum of all files = 931mb). My problem is:
I want to load all this 24 files in one program to predict 24 values - but i cannot do it. It gives an MemoryError. How can i load all 24 joblib files in one program without any errors?
Thanks in advance...
There are few options, depending on where exactly you are running out of memory.
Since you are predicting 24 different values, based on the same input data, you can do predictions sequentially. So you keep only one RFR in memory at a time.
e.g.:
predictions = []
for regressor_file in all_regressors:
regressor = joblib.load(regressor_file)
predictions.append(regressor.predict(X))
(might not be applied to your case, but this problem is very common).
You might be running out of memory when loading a large batch of input data. To solve this issue - you can split your input data and run prediction on sub-batch. That helped us when we moved from running predictions locally to EC2. Try to run your code on a smaller input dataset, to test whether this helps.
You may want to optimise parameters for RFR. You may find that you can get the same predictive power with shallower trees or smaller number of trees (or both). It is very easy to build a Random Forest that is just unnecessarily big. This is, of course, problem specific. I had to reduce number of trees and make trees smaller to make model run efficiently in production. In my case, AUC was the same before/after optimisations. This last step of model-tuning is sometimes omitted from tutorials.

Categories