been trying to run the RF classifier on a data set of ~50,000 entries with 20 or so labels which I thought should be fine but I keep coming across the following when trying to fit...
Exception MemoryError: MemoryError() in 'sklearn.tree._tree.Tree._resize' ignored
Segmentation fault (core dumped)
The data set has been passed through the TfidfVectorizer and then TruncatedSVD with n=100 for dimensionality reduction. RandomForestClassifier is running with n_jobs=1 and n_estimators=10 in an attempt to get find the minimum point at which it will work. The system is running with 4GB of RAM and RF has worked in the past on a similar data set with much higher numbers of estimators etc. Scikit-learn is running at the current version 0.14.1.
Any tips?
Thanks
Segfaults are always bugs. If a malloc fails inside RandomForest then it should be caught, and it is my best guess that this is what is happening to you. As a commenter already said, you should report this to the RandomForest bug tracker. But the malloc is probably failing because of an out of memory condition, so reduce your dimensionality, reduce your training data set size, get more memory, or run on a system with more memory.
Try to use 'psutil' library (link: https://pypi.python.org/pypi/psutil/0.5.0). Thanks to this library, you can monitorize the amount of available memory of your system with the following function:
psutil.phymem_usage()
This will help you to detect if your system does not have enough memory or if it is a problem of your code.
Related
I'm using python bertopic library for topic modeling of Polish tweets in a straightforward way, i.e.,
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
model = SentenceTransformer("sdadas/st-polish-paraphrase-from-distilroberta")
topic_model = BERTopic(embedding_model = model)
topics, probs = topic_model.fit_transform(docs)
where docs is a set of over 800k tweets.
The problem I encounter is that in case I restrict myself to 100k documents there are no problems, however, if I exceed this number I end up with a segmentation fault.
I have dug a little bit and figured out that probably the main cause of the problem is the umap package that provides the dimensionality reduction algorithm, i.e., if I first obtain the embeddings with model.encode() function and perform
umap_embeddings = umap.UMAP(n_neighbors=15, n_components=5,metric='cosine').fit_transform(embeddings)
I end up with the same result of seg fault. However, if I switch the metric to euclidean, then it all works fine with the full set of 800k documents. To be funnier, cosine metric sometimes works with 300k, 500k or even 600k documents but not always (umap is not deterministic).
I understand that the main reason could be memory issues, however, I get similar results when tested on 64GB RAM standalone PC and 378GB RAM Epyc server, even with memory_low=True option.
I have a very large dataset (7 million rows, 54 features) that I would like to fit a regression model to using XGBoost. To train the best possible model, I want to use BayesSearchCV from scikit-optimize to run the fit repeatedly for different hyperparameter combinations until the best performing set is found.
For a given set of hyperparameters, XGBoost takes a very long time to train a model, so in order to find the best hyperparameters without spending days on every permutation of training folds, hyperparameters, etc., I want to multithread both XGBoost and BayesSearchCV. The relevant part of my code looks like this:
xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42, objective='reg:squarederror', n_jobs = 1))])
xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}
xgb_kfold = KFold(n_splits = 5, random_state = 42)
xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)
xgb_unsm_cv.fit(X_train.values, y_train.values)
However, I've found that when n_jobs > 1 in the BayesSearchCV call, the fit crashes and I get the following error:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
This error persists whenever I use more than 1 thread in the BayesSearchCV call, and is independent of the memory I provide.
Is this some fundamental incompatibility between XGBoost and scikit-optimize, or can both packages be forced to work together somehow? Without some way of multithreading the optimization, I fear that fitting my model will take weeks to perform. What can I do to fix this?
I don't think the error has something to do with the incompatibility of the libraries. Rather, since you are asking for two different multi-thread operations, you are running out of the the memory as your program is trying to put the complete dataset onto your RAM not once but twice for multiple instances (depending on the threads).
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
Segmentation Fault refers to an error where the system ran out of available memory.
Note that XGBoost is a RAM hungry beast, coupling it with another multi-threaded operation is bound to take a toll(and personally, not recommended with daily driver machines.)
The most viable solution would be to probably use Google's TPU or some other cloud service (beware of the costs), or use some technique to reduce the size of the dataset for processing using some statistical techniques like the ones mentioned in this kaggle notebook and Data Science StackExchange Article.
The idea is, either you upscale the hardware (monetary cost), go head-on with single thread BayesianCV (time cost) or downsize the data using whatever technique best suits you.
Finally, the answer still is that the libraries are probably compatible, just the data is too large for the available RAM.
I'm facing the following problem, I'm running a SVR from the scikit-learn library on a training set with about 46500 obsevations and it runs more than six hours, until now.
I'm using the linear kernel.
def build_linear(self):
model = SVR(kernel='linear', C=1)
return model
I already tried changing the "C" value between 1e-3 and 1000 nothing changes.
The poly kernel runs in about 5 minutes, but I need the values for an evaluation and can skip this part...
Does anyone got an idea how to speed this up?
Thanks a lot!
SVMs are known to scale badly with the number of samples!
Instead of SVR with a linear-kernel, use LinearSVR or for huge data: SGDClassifier
LinearSVR is more restricted in terms of what it can compute (no non-linear kernels) and more restricted algorithms usually have more assumptions and use these to speed-up things (or save memory).
SVR is based on libsvm, while LinearSVR is based on liblinear. Both are well-tested high-quality implementations.
(It might be valuable to add: don't waste time in general cases like these waiting 6 hours. Sub-sample your data and try smaller, less small, ... examples and deduce runtime or problems from that. edit: it seems you did that already, good!).
I am using LightGBM 2.0.6 Python API. My training data has around 80K samples and 400 features, and I am training a model with ~2000 iterations, and the model is for multi-class classification (#classes = 10). When the model is trained, and when I called model.feature_importance(), I encountered segmentation fault.
I tried to generate artificial data to test (with the same number of samples, classes, iterations and hyperparameters), and I can successfully obtain the list of feature importance. Therefore I suspect whether the problem occurs depends on the training data.
I would like to see if someone else has encountered this problem and if so how was it overcome. Thank you.
This is a bug in LightGBM; 2.0.4 doesn't have this issue. It should be also fixed in LightGBM master. So either downgrade to 2.0.4, wait for a next release, or use LightGBM master.
The problem indeed depends on training data; feature_importances segfault only when there are "constant" trees in the trained ensemble, i.e. trees with a single leaf, without any splits.
I'm using Libsvm in a 5x2 cross validation to classify a very huge amount of data, that is, I have 47k samples for training and 47k samples for testing in 10 different configurations.
I usually use the Libsvm's script easy.py to classify the data, but it's taking so long, I've been waiting for results for more than 3 hours and nothing, and I still have to repeat this procedure more 9 times!
does anybody know how to use the libsvm faster with a very huge amount of data? does the C++ Libsvm functions work faster than the python functions?
LibSVM's training algorithm doesn't scale up to this kind of datasets; it takes O(n³) time in the worst case and around O(n²) on typical ones. The first thing to try is scaling your datasets properly; if it still doesn't work, switch to
a linear SVM (or logistic regression) using, e.g., Liblinear, Vowpal Wabbit or Wapiti
a more scalable kernel SVM implementation such as LaSVM
As larsmans mentioned, libsvm may not scale all that well depending on the dimensionality of the data and the number of data points.
The C implementation may run a bit faster, but it won't be a significant difference. You have a few options available to you.
You could randomly sample your data to work on a small subset of it.
You could project your data into a lower dimension with something like PCA
Depending on your data type, you can look into different kernels. Would a histogram intersection kernel work out for your data? Are you using an RBF kernel when you really just need a linear decision function?
Hope this helps! One of the toughest problems in machine learning is coping with the pure magnitude of data required at times.
easy.py is a script for training and evaluating a classifier. it does a metatraining for the SVM parameters with grid.py. in grid.py is a parameter "nr_local_worker" which is defining the mumber of threads. you might wish to increase it (check processor load).