PySpark Machine Learning on Wide Data in Qubole

PySpark Machine Learning on Wide Data in Qubole - python

I have a large dataset, with roughly 250 features, that I would like to use in a gradient-boosted trees classifier. I have millions of observations, but I'm having trouble getting the model to work with even 1% of my data (~300k observations). Below is a snippet of my code. I am unable to share any data for you, but all features are numeric (either a numerical variable or a dummy variable for various factor levels). I use VectorAssembler to create a features variable containing the vector of features from the corresponding observation.
When I reduce the number of features used by the model, say to 5, the model runs without issue. Only when I make the problem more complex by adding a large number of features does it begin to fail. The error I get is a TTransport Exception. The model will try to run for hours before it errors out. I am building my model using Qubole. I am new to both Qubole and PySpark, so I'm not sure if my issue is a spark memory issue, Qubole memory (my cluster has 4+ TB, data only a few GB), etc.
Any thoughts or ideas for testing/debugging would be helpful. Thanks.
train = train.withColumnRenamed(target, "label")
test = test.withColumnRenamed(target, "label")
evaluator = BinaryClassificationEvaluator()
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
gbtPredictions = gbtModel.transform(test)
gbtPredictions.select('label','rawPrediction', 'prediction', 'probability').show(10)
print("Test Area Under ROC: " + str(evaluator.evaluate(gbtPredictions, {evaluator.metricName: "areaUnderROC"})))

You would want to try this https://docs.qubole.com/en/latest/troubleshooting-guide/notebook-ts/troubleshoot-notebook.html#ttexception. If this still doesn't help feel free to create us a support ticket and we would be happy to investigate.

Related

Saving pomegranate Bayesian Network models

I am making some rather big Bayesian Networks for generating synthetic data, and I find pomegranate to be a good alternative as it generates data quickly and easily allows for inputting evidence. I have one problem with it: saving the trained models. Pomegranate's built-in methods stores as json's so big that I run out of memory when I have 30 or so variables, even when using "lighter" algorithms. The models can not be pickled due to the error
TypeError: self.distributions_ptr,self.parent_count,self.parent_idxs cannot be converted to a Python object for pickling
I am wondering if anyone has a good alternative for storing pomegranate models, or else knows of a Bayesian Network library that generates data quickly after training. I would be grateful for any tips.

if your model can be learned and stored in the memory, it can be saved in a file, but maybe not by 'pickling'. There are many different formats for Bayesian networks (bif, xmlbif, dsl, uai, etc.). I don't know pomegranate, but there is certainly a way to read/save using such a format. With pyAgrum (of which I am one of the authors), you just have to write gum.saveBN(model, "model.xxx") to save it, and then bn=gum.loadBN("model.xxx") to read it ... You can choose xxx among all the supported format, for now : bif|dsl|net|bifxml|o3prm|uai (https://pyagrum.readthedocs.io/en/1.3.1/functions.html#pyAgrum.loadBN).
As far as I understand, evidence for a sampling is just a way to filter the samples by keeping only the samples that respect the constraints (rejection sampling). There is no such a direct method in pyAgrum but this is can be done as a post-process :
import pyAgrum as gum
#create a BN with random CPTs
bn=gum.fastBN("A->B{yes|maybe|no}<-C->D->E<-F<-B")
# generate a sample of size 100
g=gum.BNDatabaseGenerator(bn)
g.setRandomVarOrder()
g.drawSamples(100)
df=g.to_pandas()
#filtering the dataframe
rslt_df = df[(df['B'] == "yes") &
(df['E'] == "1")]
And in a notebook :

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.

If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

Updating linear regression with many features

The problem I have is the following:
I have a csv file with roughly 10 million rows. With that, I want to to run a linear regression with many interaction terms. In the end I will have 3000 such interactions. So creating them by hand would give a dataset of shape (10mio, 3000) which won't fit into memory anymore. Furthermore, I need to center these interactions terms prior to fitting.
Fixed effects are not possible as the interactions contain continouus variables and not true dummies (they will have mostly 0, some 1, and few 0.5).
The plan I have for now is the following:
Use dask (http://docs.dask.org/en/latest/dataframe.html) to read in the csv file. Then create the interactions and save them out of core so that I don't have memory problems, here pandas would fail. How can I create the interactions with dask efficiently?
Center the created interaction terms with sklearn's StandardScaler (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler). Should I just loop over a reasonable amount of columns (say, 100), center these 100, and store the centered variables again on disk? From the dask documentation I guess it is easily possible to combine dask with sklearn?
Fit the model using Stochastic Gradient Descent (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) with the partial_fit() attribute to do it part by part in a loop. I am also aware of other updating ways (https://dahtah.wordpress.com/2011/11/29/rank-one-updates-for-faster-matrix-inversion/) but have not found implementations for python?
Predict part by part in a loop.
Do you think this is a reasonable plan?

Subsampling an unbalanced dataset in tensorflow

Tensorflow beginner here. This is my first project and I am working with pre-defined estimators.
I have an extremely unbalanced dataset where positive outcomes represent roughly 0.1% of the total data and I suspect this imbalance to considerably affect the performance of my model. As a first attempt to solve the issue, since I have tons of data, I would like to throw away most of my negatives in order to create a balanced dataset. I can see two ways of doing it: preprocess the data to keep only a thousandth of the negatives then save it in a new file before passing it to tensorflow, for example with pyspark; and asking tensorflow to use only one negative out of a thousand it finds.
I tried to code this last idea but didn't manage. I modified my input function to read like
def train_input_fn(data_file="../data/train_input.csv", shuffle_size=100_000, batch_size=128):
"""Generate an input function for the Estimator."""
dataset = tf.data.TextLineDataset(data_file) # Extract lines from input files using the Dataset API.
dataset = dataset.map(parse_csv, num_parallel_calls=3)
dataset = dataset.shuffle(shuffle_size).repeat().batch(batch_size)
iterator = dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
# TRY TO IMPLEMENT THE SELECTION OF NEGATIVES
thrown = 0
flag = np.random.randint(1000)
while labels == 0 and flag != 0:
features, labels = iterator.get_next()
thrown += 1
flag = np.random.randint(1000)
print("I've thrown away {} negative examples before going for label {}!".format(thrown, labels))
return features, labels
This, of course, doesn't work because iterators don't know what's inside them, so the labels==0 condition is never satisfied. Also, there is only one print in the stdout, meaning that this function is only called once (and meaning that I still don't understand how tensorflow really works). Anyways, is there a way to implement what I want?
PS: I suspect that the previous code, even if it worked as intended, would return less than a thousandth of the initial negatives due to the count restarting every time it finds a positive. This is a minor issue, and so far I could even find a magic number inside the flag that gives me the expected result without worrying too much about the mathematical beauty of it.

You will probably get better results by oversampling your under-represented class rather than throwing away data in your over-represented class. This way you keep the variance in the over-represented class. You might as well use the data you have.
The easiest way to achieve this is probably to create two Datasets, one for each class. Then you can use Dataset.interleave to sample equally from both datasets.
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#interleave

Oversampling can be easily achieved with following code:
resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.7, 0.3])
Tensorflow has a good guide on dealing with unbalanced data you can find more ideas here:
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#oversampling

sklearn and large datasets

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.
I use a lot sklearn but for much smaller datasets.
In this situations the classical approach should be something like.
Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.
I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.
Now I am wondering is there an easy why to do that in sklearn?
I am looking for something like
r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
x = r.read_next_chunk(20 lines)
m.partial_fit(x)
m.predict(new_x)
Maybe sklearn is not the right tool for these kind of things?
Let me know.

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.
You can find the methodology, the case study and some good resources in of this post:
http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/

I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.
This link may be useful...
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
I agree that h5py is useful but you may wish to use tools that are already in your quiver.
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).

You may want to take a look at Dask or Graphlab
http://dask.pydata.org/en/latest/
https://turi.com/products/create/
They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.

I find it interesting that you have chosen to use Python for statistical analysis rather than R however, I would start by putting my data into a format that can handle such large datasets. The python h5py package is fantastic for this kind of storage - allowing very fast access to your data. You will need to chunk up your data in reasonable sizes say 1 million element chunks e.g. 20 columns x 50,000 rows writing each chunk to the H5 file. Next you need to think about what kind of model you are running - which you haven't really specified.
The fact is that you will probably have to write the algorithm for model and the machine learning cross validation because the data is large. Start by writing an algorithm to summarize the data, so that you know what you am looking at. Then once you decide what model you want to run you will need to think about what the cross validation will be. Put in a "column" into each chunk of the data set that denotes which validation set each row belongs to. You many choose to label each chunk to a particular validation set.
Next you will need to write a map reduce style algorithm to run your model on the validation subsets. The alternative is simply to run models on each chunk of each validation set and average the result (consider the theoretical validity of this approach).
Consider using spark, or R and rhdf5 or something similar. I haven't supplied any code because this is a project rather than just a simple coding question.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.