I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.
Related
Is there a memory efficient way to apply large (>4GB) models to Spark Dataframes without running into memory issues?
We recently ported a custom pipeline framework over to Spark (using python and pyspark) and ran into problems when applying large models like Word2Vec and Autoencoders to tokenized text inputs. First I very naively converted the transformation calls to udfs (both pandas and spark "native" ones), which was fine, as long as the models/utilities used were small enough to either be broadcasted, or instantiated repeatedly:
#pandas_udf("array<string>")
def tokenize_sentence(sentences: pandas.Series):
return sentences.map(lambda sentence: tokenize.word_tokenize(sentence))
Trying the same approach with large models (e.g. for embedding those tokens into vector space via word2vec) resulted in terrible performance, and I get why:
#pandas_udf("array<array<double>>")
def rows_to_lists_of_vectors(rows):
model = api.load('word2vec-google-news-300')
def words_to_vectors(words) -> List[List[float]]:
vectors = []
for word in words:
if word in model:
vec = model[word]
vectors.append(vec.tolist())
return vectors
return rows.map(words_to_vectors)
The code from above would instantiate the ~4Gb word2vec model repeatedly, loading it from disk into RAM, which is very slow. I could remedy this by using mapPartition, which would at least only load it once per partition. But more importantly this would crash with memory related issues (at least on my dev machine), if I didn't heavily restrict the number of tasks, which in turn made the small udfs very slow. For example restricting the number of tasks to 2 would solve the memory crashes, but make the tokenizing painfully slow.
I understand there is an entire pipeline framework in Spark, that would fit our needs, but before committing to that, I'd like to understand how the problems I ran into were solved there. Maybe there are some key practices we could use instead of having to rewrite our framework.
My actual question therefore is twofold:
Would using the Spark pipeline framework solve our issues regarding performance and memory, assuming we wrote custom Estimators and Transformers for the steps, that are not covered by Spark out of the box (like e.g. Tokenizers and Word2Vec).
How does Spark solve those issues, if at all? Can I improve the current approach, or is this impossible using python (where to my understanding processes don't share memory space).
If any of the above makes you believe I missed a core principle with Spark, please point it out, after all I'm just getting started with Spark.
This much vary on various factors (models, cluster resourcing, pipeline) but trying to answers to your main question :
1). Spark pipeline might solve your problem if they fits your needs in terms of the Tokenizers, Words2Vec, etc. However those are not so powerful as the one already available of the shelf and loaded with api.load. You might also want to take a look to Deeplearning4J which brings those to Java/Apache Spark and see how it can do the same things: tokenize, word2vec,etc
2). Following the current approach I would see loading the model in a foreachParition or mapPartition and ensure the model can fit into memory per partition. You can shrink down the partition size to a more affordable number based on the cluster resources to avoid memory problems (it's the same when for example instead of creating a db connection for each row you have one per partition).
Typically Spark udfs are good when you apply a kind o business logic that is spark friendly and not mixin 3rd external parties.
I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.
After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?
It sounds like you simply want to use tf.data.Dataset.from_generator() you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.
The import data programmers guide covers the Dataset input pipeline in more detail. The tensorflow Dataset will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders or marshaling data outside of the tf.data.Dataset.from_generator() code you write.
I'm quite comfortable using XGBoost to come up with predictive models; my concern is using it with a dataset that is (to me, at least) massive. I have 4 ~20gb CSV files with some training data that I am trying to clean up and get ready for model training. I am a little confused on how to start getting the data 'primed' for everything else; a few thoughts that I had (and I'm not certain if they are the best) and some limitations I foresee:
pymysql or sqlalchemy: take data, somehow pass it in to a SQL database. QUESTION: Do I process data first, or process it once it is in the database?
Dask on a single computer (and not a cluster); again, just not sure how to interface it with XGBoost after one-hot encoding.
Using Numpy somehow; I remember reading about how that would work with representing arrays of each column somehow but I can't be damned to remember.
HDF5 file format; still don't think it would make it small enough to work with reasonably.
My system has 24 GB of RAM on 64-bit Ubuntu. Is there any way to use swap memory somehow to do all of the processing? It would be stupidly slow, certainly.
Effectively I'm wondering what one would recommend for cleaning, one-hot encoding, and training a machine learning algorithm with such a massive data set. Thank you!
I'm manipulating a huge DataFrame stored using HDFStore objects, the table is too big to be completely loaded in memory so I have to extract the data chunck by chunk, which is fine for a lot of tasks.
Here comes my problem, I would like to apply a PCA on the table which requires the whole DataFrame to be loaded but I don't have enough memory to do that.
The PCA function takes a numpy array or a pandas DataFrame as input, is there another way to apply a PCA that would directly use an object stored on disk?
Thank you a lot in advance,
ClydeX
Seems like a perfect fit for the new IncrementalPCA in the 0.16 dev branch of scikit-learn.
Update: link to the latest stable version
I'm classifying text data using TfidfVectorizer from scikit-learn.
I transform my dataset and it turns into a 75k by 172k sparse matrix with 865k stored elements. (I used ngrams ranges of 1,3)
I can fit my data after a long time, but it still indeed fits.
However, when I try to predict the test set I get memory issues. Why is this? I would think that the most memory intensive part would be fitting not predicting?
I've tried doing a few things to circumvent this but have had no luck. First I tried dumping the data locally with joblib.dump, quitting python and restarting. This unfortunately didn't work.
Then I tried switching over to a HashingVectorizer but ironically, the hashing vectorizer causes memory issues on the same data set. I was under the impression a Hashing Vectorizer would be more memory efficient?
hashing = HashingVectorizer(analyzer='word',ngram_range=(1,3))
tfidf = TfidfVectorizer(analyzer='word',ngram_range=(1,3))
xhash = hashing.fit_transform(x)
xtfidf = tfidf.fit_transform(x)
pac.fit(xhash,y) # causes memory error
pac.fit(xtfidf,y) # works fine
I am using scikit learn 0.15 (bleeding edge) and windows 8.
I have 8 GB RAM and a hard drive with 100 GB free space. I set my virtual RAM to be 50 GB for the purposes of this project. I can set my virtual RAM even higher if needed, but I'm trying to understand the problem before just blunt force try solutions like I have been for the past couple days...I've tried with a few different classifiers: mostly PassiveAggressiveClassifier, Perception, MultinomialNB, and LinearSVC.
I should also note that at one point I was using a 350k by 472k sparse matrix with 12M stored elements. I was still able to fit the data despite it taking some time. However had memory errors when predicting.
The scikit-learn library is strongly optimized (and uses NumPy and SciPy). TfidVectorizer stores sparse matrices (relatively small in size, compared with standard dense matrices).
If you think that it's issue with memory, you can set the max_features attribute when you create Tfidfvectorizer. It maybe useful for check your assumptions
(for more detail about Tfidfvectorizer, see the documentation).
Also, I can reccomend that you reduce training set, and check again; it can also be useful for check your assumptions.