I am making some rather big Bayesian Networks for generating synthetic data, and I find pomegranate to be a good alternative as it generates data quickly and easily allows for inputting evidence. I have one problem with it: saving the trained models. Pomegranate's built-in methods stores as json's so big that I run out of memory when I have 30 or so variables, even when using "lighter" algorithms. The models can not be pickled due to the error
TypeError: self.distributions_ptr,self.parent_count,self.parent_idxs cannot be converted to a Python object for pickling
I am wondering if anyone has a good alternative for storing pomegranate models, or else knows of a Bayesian Network library that generates data quickly after training. I would be grateful for any tips.
if your model can be learned and stored in the memory, it can be saved in a file, but maybe not by 'pickling'. There are many different formats for Bayesian networks (bif, xmlbif, dsl, uai, etc.). I don't know pomegranate, but there is certainly a way to read/save using such a format. With pyAgrum (of which I am one of the authors), you just have to write gum.saveBN(model, "model.xxx") to save it, and then bn=gum.loadBN("model.xxx") to read it ... You can choose xxx among all the supported format, for now : bif|dsl|net|bifxml|o3prm|uai (https://pyagrum.readthedocs.io/en/1.3.1/functions.html#pyAgrum.loadBN).
As far as I understand, evidence for a sampling is just a way to filter the samples by keeping only the samples that respect the constraints (rejection sampling). There is no such a direct method in pyAgrum but this is can be done as a post-process :
import pyAgrum as gum
#create a BN with random CPTs
bn=gum.fastBN("A->B{yes|maybe|no}<-C->D->E<-F<-B")
# generate a sample of size 100
g=gum.BNDatabaseGenerator(bn)
g.setRandomVarOrder()
g.drawSamples(100)
df=g.to_pandas()
#filtering the dataframe
rslt_df = df[(df['B'] == "yes") &
(df['E'] == "1")]
And in a notebook :
Related
I'm interested in learning to rank with pairwise comparison. While working on this I found that XGBoost has a model called XGBRanker which works very well.
I want to find out how the XGBRanker manages the training data to get such low memory usage and great results?(It uses LambdaMART I believe) I imagine it must be some kind of lookup table for the features and maybe making the pairs iteratively or not using all possible permutations with different labels within one group.
I tried looking through the source code but everything keeps referring to some other XGBoost method and I haven't been able to understand it so far.
I would like to create a similar method to train NNs for pairwise comparison but handling the training data has been a huge hurdle so far.
So more generally my Question would be: How are the pairs created in pairwise ranking anlgorithms?(RankNet,LambdaNet and so on) Are all pairs used? Only a percentage? Is there some other way of doing this? If you're working with >100.000 items you would easily get into the range of hundreds of millions.
I hope someone has some information about this or knows who might.
I have a large dataset, with roughly 250 features, that I would like to use in a gradient-boosted trees classifier. I have millions of observations, but I'm having trouble getting the model to work with even 1% of my data (~300k observations). Below is a snippet of my code. I am unable to share any data for you, but all features are numeric (either a numerical variable or a dummy variable for various factor levels). I use VectorAssembler to create a features variable containing the vector of features from the corresponding observation.
When I reduce the number of features used by the model, say to 5, the model runs without issue. Only when I make the problem more complex by adding a large number of features does it begin to fail. The error I get is a TTransport Exception. The model will try to run for hours before it errors out. I am building my model using Qubole. I am new to both Qubole and PySpark, so I'm not sure if my issue is a spark memory issue, Qubole memory (my cluster has 4+ TB, data only a few GB), etc.
Any thoughts or ideas for testing/debugging would be helpful. Thanks.
train = train.withColumnRenamed(target, "label")
test = test.withColumnRenamed(target, "label")
evaluator = BinaryClassificationEvaluator()
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
gbtPredictions = gbtModel.transform(test)
gbtPredictions.select('label','rawPrediction', 'prediction', 'probability').show(10)
print("Test Area Under ROC: " + str(evaluator.evaluate(gbtPredictions, {evaluator.metricName: "areaUnderROC"})))
You would want to try this https://docs.qubole.com/en/latest/troubleshooting-guide/notebook-ts/troubleshoot-notebook.html#ttexception. If this still doesn't help feel free to create us a support ticket and we would be happy to investigate.
I need to use bag of words (in this case bag of features) to generate descriptor vectors to classify the KTH video dataset. In order to do this, I need to use kmeans clustering algorithm to cluster the extracted features and find the codebook. The extracted features from dataset form approximately 75000 vectors of 100 elements each. So I'm facing memory issues using the scipy.cluster.kmeans2 implementation in Ubuntu. I runed some tests and discovered that with 32000 vector with 100 elements each, the amount of memory used is around 20GB (my total memory is 32GB).
Is there any other Python kmeans implementation more memory effcient?
I already read about Mahout for clustering big data, but I still not understand what is his advantages, is it more memory-efficient with that mentioned amount of data?
When having many samples, consider using sklearn's MiniBatchKMeans, which is a SGD-like method build for this case! (A more tutorial-like intro which does not address memory-usage, but i expect it to be better there for large n_samples. Of course memory also depends on many other parameters like k ... In the case of huge n_features it won't help in regards to memory; but that's not your problem here)
In this case you should carefully tune your mini-batch sizes then.
You can try the classic kmeans implementation there too as you seem to be just quite off the memory-requirements and maybe this implementation is more efficient (more tunable for sure).
In the latter case, init, n_init, precompute_distances, algorithm and maybe copy_x are all parameters having effect on memory-consumption.
And furthermore: if(!) your data is sparse; try calling it with sparse-matrices. (from reading kmeans2-docs it seems it's not supported, but sklearn's kmeans does!)
I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.
I use a lot sklearn but for much smaller datasets.
In this situations the classical approach should be something like.
Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.
I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.
Now I am wondering is there an easy why to do that in sklearn?
I am looking for something like
r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
x = r.read_next_chunk(20 lines)
m.partial_fit(x)
m.predict(new_x)
Maybe sklearn is not the right tool for these kind of things?
Let me know.
I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.
You can find the methodology, the case study and some good resources in of this post:
http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/
I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.
This link may be useful...
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
I agree that h5py is useful but you may wish to use tools that are already in your quiver.
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).
You may want to take a look at Dask or Graphlab
http://dask.pydata.org/en/latest/
https://turi.com/products/create/
They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.
I find it interesting that you have chosen to use Python for statistical analysis rather than R however, I would start by putting my data into a format that can handle such large datasets. The python h5py package is fantastic for this kind of storage - allowing very fast access to your data. You will need to chunk up your data in reasonable sizes say 1 million element chunks e.g. 20 columns x 50,000 rows writing each chunk to the H5 file. Next you need to think about what kind of model you are running - which you haven't really specified.
The fact is that you will probably have to write the algorithm for model and the machine learning cross validation because the data is large. Start by writing an algorithm to summarize the data, so that you know what you am looking at. Then once you decide what model you want to run you will need to think about what the cross validation will be. Put in a "column" into each chunk of the data set that denotes which validation set each row belongs to. You many choose to label each chunk to a particular validation set.
Next you will need to write a map reduce style algorithm to run your model on the validation subsets. The alternative is simply to run models on each chunk of each validation set and average the result (consider the theoretical validity of this approach).
Consider using spark, or R and rhdf5 or something similar. I haven't supplied any code because this is a project rather than just a simple coding question.
I am collecting a lot of really interesting data points as users come to my Python web service. For example, I have their current city, state, country, user-agent, etc. What I'd like to be able to do is run these through some type of machine learning system / algorithm (maybe a Bayesian classifier?), with the eventual goal of getting e-mail notifications when something out-of-the-ordinary occurs (anomaly detection). For example, Jane Doe has only ever logged in from USA on Chrome. So if she suddenly logs into my web service from the Ukraine on Firefox, I want to see that as a highly 'unusual' event and fire off a notification.
I am using CouchDB (specifically with Cloudant) already, and I see people often saying here and there online that Cloudant / CouchDB is perfect for this sort of thing (big data analysis). However I am at a complete loss for where to start. I have not found much in terms of documentation regarding relatively simple tracking of outlying events for a web service, let alone storing previously 'learned' data using CouchDB. I see several dedicated systems for doing this type of data crunching (PredictionIO comes to mind), but I can't help but feel that they are overkill given the nature of CouchDB in the first place.
Any insight would be much appreciated. Thanks!
You're correct in assuming that this is a problem ideally suited to Machine Learning, and scikit-learn.org is my preferred library for these types of problems. Don't worry about specifics - (couchdb cloudant) for now, lets get your problem into a state where it can be solved.
If we can assume that variations in log-in details (time, location, user-agent etc.) for a given user are low, then any large variation from this would trigger your alert. This is where the 'outlier' detection that #Robert McGibbon suggested comes into play.
For example, squeeze each log-in detail into one dimension, and the create a log-in detail vector for each user (there is significant room for improving this digest of log-in information);
log-in time (modulo 24 hrs)
location (maybe an array of integer locations, each integer representing a different country)
user-agent (a similar array of integer user-agents)
and so on. Every time a user logs in, create this detail array and store it. Once you have accumulated a large set of test data you can try running some ML routines.
So, we have a user and a set of log-in data corresponding to successful log-ins (a training set). We can now train a Support Vector Machine to recognise this users log-in pattern:
from sklearn import svm
# training data [[11.0, 2, 2], [11.3, 2, 2] ... etc]
train_data = my_training_data()
# create and fit the model
clf = svm.OneClassSVM()
clf.fit(train_data)
and then, every time a new log-in even occurs, create a single log-in detail array and pass that past the SVM
if clf.predict(log_in_data) < 0:
fire_alert_event()
else:
# log-in is not dissimilar to previous attempts
print('log in ok')
if the SVM finds the new data point to be significantly different from it's training set then it will fire the alarm.
My Two Pence. Once you've got hold of a good training set, there are many more ML techniques that may be better suited to your task (they may be faster, more accurate etc) but creating your training sets and then training the routines would be the most significant challenge.
There are many exciting things to try! If you know you have bad log-in attempts, you can add these to the training sets by using a more complex SVM which you train with good and bad log-ins. Instead of using an array of disparate 'location' values, you could find the Euclidean different log-ins and use that! This sounds like great fun, good luck!
I also thought the approach using svm.OneClassSVM from sklearn was going to produce a good outlier detector. However, I put together some representative data based upon the example in the question and it simply could not detect an outlier. I swept the nu and gamma parameters from .01 to .99 and found no satisfactory SVM predictor.
My theory is that because the samples have categorical data (cities, states, countries, web browsers) the SVM algorithm is not the right approach. (I did, BTW, first convert the data into binary feature vectors with the DictVectorizer.fit_transform method).
I believe #sullivanmatt is on the right track when he suggests using a Bayesian classifier. Bayesian classifiers are used for supervised learning but, at least on the surface, this problem was cast as an unsupervised learning problem, ie we don't know a priori which observations are normal and which are outliers.
Because the outliers you want to detect are very rare in the stream of web site visits, I believe you could train the Bayesian classifier by labeling every observation in your training set as a positive/normal observation. The classifier should predict that true normal observations have higher probability simply because the majority of the observations really are normal. A true outlier should stand out as receiving a low predicted probability.
If you're trying to investigate on anomalies of user behaviours during the time, I'd recommend you to look at time-series anomaly detectors. With this approach you'll be able to statistically/automatically figure out new, potentially suspicious, emerging patters and abnormal events.
http://www.autonlab.org/tutorials/biosurv.html and http://web.engr.oregonstate.edu/~wong/workshops/icml2006/slides/agarwal.ppt
explain some techniques based on machine learning. In this case you can use scikit-learn.org, a very powerful Python library that contains tons of ML algos.