Import TensorFlow data from pyspark - python

I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.
After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?

It sounds like you simply want to use tf.data.Dataset.from_generator() you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.
The import data programmers guide covers the Dataset input pipeline in more detail. The tensorflow Dataset will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders or marshaling data outside of the tf.data.Dataset.from_generator() code you write.

Related

How to use Spark for machine learning workflows with big models

Is there a memory efficient way to apply large (>4GB) models to Spark Dataframes without running into memory issues?
We recently ported a custom pipeline framework over to Spark (using python and pyspark) and ran into problems when applying large models like Word2Vec and Autoencoders to tokenized text inputs. First I very naively converted the transformation calls to udfs (both pandas and spark "native" ones), which was fine, as long as the models/utilities used were small enough to either be broadcasted, or instantiated repeatedly:
#pandas_udf("array<string>")
def tokenize_sentence(sentences: pandas.Series):
return sentences.map(lambda sentence: tokenize.word_tokenize(sentence))
Trying the same approach with large models (e.g. for embedding those tokens into vector space via word2vec) resulted in terrible performance, and I get why:
#pandas_udf("array<array<double>>")
def rows_to_lists_of_vectors(rows):
model = api.load('word2vec-google-news-300')
def words_to_vectors(words) -> List[List[float]]:
vectors = []
for word in words:
if word in model:
vec = model[word]
vectors.append(vec.tolist())
return vectors
return rows.map(words_to_vectors)
The code from above would instantiate the ~4Gb word2vec model repeatedly, loading it from disk into RAM, which is very slow. I could remedy this by using mapPartition, which would at least only load it once per partition. But more importantly this would crash with memory related issues (at least on my dev machine), if I didn't heavily restrict the number of tasks, which in turn made the small udfs very slow. For example restricting the number of tasks to 2 would solve the memory crashes, but make the tokenizing painfully slow.
I understand there is an entire pipeline framework in Spark, that would fit our needs, but before committing to that, I'd like to understand how the problems I ran into were solved there. Maybe there are some key practices we could use instead of having to rewrite our framework.
My actual question therefore is twofold:
Would using the Spark pipeline framework solve our issues regarding performance and memory, assuming we wrote custom Estimators and Transformers for the steps, that are not covered by Spark out of the box (like e.g. Tokenizers and Word2Vec).
How does Spark solve those issues, if at all? Can I improve the current approach, or is this impossible using python (where to my understanding processes don't share memory space).
If any of the above makes you believe I missed a core principle with Spark, please point it out, after all I'm just getting started with Spark.
This much vary on various factors (models, cluster resourcing, pipeline) but trying to answers to your main question :
1). Spark pipeline might solve your problem if they fits your needs in terms of the Tokenizers, Words2Vec, etc. However those are not so powerful as the one already available of the shelf and loaded with api.load. You might also want to take a look to Deeplearning4J which brings those to Java/Apache Spark and see how it can do the same things: tokenize, word2vec,etc
2). Following the current approach I would see loading the model in a foreachParition or mapPartition and ensure the model can fit into memory per partition. You can shrink down the partition size to a more affordable number based on the cluster resources to avoid memory problems (it's the same when for example instead of creating a db connection for each row you have one per partition).
Typically Spark udfs are good when you apply a kind o business logic that is spark friendly and not mixin 3rd external parties.

TensorFlow: Is it possible to map a function to a dataset using a for-loop?

I have a tf.data.TFRecordDataset and a (computationally expensive) function, which I want to map to it. I use TensorFlow 1.12 and eager execution, and the function uses NumPy ndarray interpretations of the tensors in my dataset using EagerTensor.numpy(). However, code inside functions that are given to tf.Dataset.map() are not executed eagerly, which is why the .numpy() conversion doesn't work there and .map() is not an option anymore. Is it possible to for-loop through a dataset and modify the examples in it? Simply assigning to them doesn't seem to work.
No, not exactly.
A Dataset is inherently lazily evaluated and cannot be assigned to in that way - conceptually try to think of it as a pipeline rather than a variable: each value is read, passed through any map() operations, batch() ops, etc and surfaced to the model as needed. To "assign" a value would be to write it to disk in the .tfrecord file and just isn't likely to ever be supported (these files are specifically designed to be fast-read not random-accessed).
You could, instead, use TensorFlow to do your pre-processing and use TfRecordWriter to write to a NEW tfrecord with the expensive pre-processing completed then use this new dataset as the input to your model. If you have the disk space avilable this might well be your best option.

Large (80gb) dataset, Pandas, and xgboost

I'm quite comfortable using XGBoost to come up with predictive models; my concern is using it with a dataset that is (to me, at least) massive. I have 4 ~20gb CSV files with some training data that I am trying to clean up and get ready for model training. I am a little confused on how to start getting the data 'primed' for everything else; a few thoughts that I had (and I'm not certain if they are the best) and some limitations I foresee:
pymysql or sqlalchemy: take data, somehow pass it in to a SQL database. QUESTION: Do I process data first, or process it once it is in the database?
Dask on a single computer (and not a cluster); again, just not sure how to interface it with XGBoost after one-hot encoding.
Using Numpy somehow; I remember reading about how that would work with representing arrays of each column somehow but I can't be damned to remember.
HDF5 file format; still don't think it would make it small enough to work with reasonably.
My system has 24 GB of RAM on 64-bit Ubuntu. Is there any way to use swap memory somehow to do all of the processing? It would be stupidly slow, certainly.
Effectively I'm wondering what one would recommend for cleaning, one-hot encoding, and training a machine learning algorithm with such a massive data set. Thank you!

Working with large (+15 gb) CSV datasets and Pandas/XGBoost

I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.

Representing time sequence input/output in tensorflow

I've been working through the TensorFlow documentation (still learning), and I can't figure out how to represent input/output sequence data. My inputs are a sequences of 20 8-entry vectors, making a 8x20xN matrix, where N is the number of instances. I'd like to eventually pass these through an LSTM for sequence to sequence learning. I know I need a 3D vector, but I'm unsure which dimensions are which.
RTFMs with pointers to the correct documentation greatly appreciated. I feel like this is obvious and I'm just missing it.
As described in the excellent blog post by WildML, the proper way is to save your example in a TFRecord using the formate tf.SequenceExample(). Using TFRecords for this provides the following advantages:
You can split your data in to many files and load them each on a different GPU.
You can use Tensorflow utilities for loading the data (for example using Queues to load you data on demand.
Your model code will be separate from your dataset processing (this is a good habit to have).
You can bring new data to your model just by putting it into this format.
TFRecords uses protobuf or protocol buffers as a way to format your data. The documentation of which can be found here. The basic idea is you have a format for your data (in this case in the format of tf.SequenceExample) save it to a TFRecord and load it using the same data definition. Code for this pattern can be found at this ipython notebook.
As my answer is mostly summarizing the WildML blog post on this topic, I suggest you check that out, again found here.

Categories