I'm quite comfortable using XGBoost to come up with predictive models; my concern is using it with a dataset that is (to me, at least) massive. I have 4 ~20gb CSV files with some training data that I am trying to clean up and get ready for model training. I am a little confused on how to start getting the data 'primed' for everything else; a few thoughts that I had (and I'm not certain if they are the best) and some limitations I foresee:
pymysql or sqlalchemy: take data, somehow pass it in to a SQL database. QUESTION: Do I process data first, or process it once it is in the database?
Dask on a single computer (and not a cluster); again, just not sure how to interface it with XGBoost after one-hot encoding.
Using Numpy somehow; I remember reading about how that would work with representing arrays of each column somehow but I can't be damned to remember.
HDF5 file format; still don't think it would make it small enough to work with reasonably.
My system has 24 GB of RAM on 64-bit Ubuntu. Is there any way to use swap memory somehow to do all of the processing? It would be stupidly slow, certainly.
Effectively I'm wondering what one would recommend for cleaning, one-hot encoding, and training a machine learning algorithm with such a massive data set. Thank you!
Related
I have a 500+ MB CSV data file. My question is, which would be faster for data manipulation (e.g., reading, processing) is the Python MySQL client would be faster since all work is mapped into SQL queries and optimization is left to the optimizer. But, at the same time Pandas is dealing with a file which should be faster than communicating with a server?
I have already checked "Large data" work flows using pandas, Best practices for importing large CSV files, Fastest way to write large CSV with Python, and Most efficient way to parse a large .csv in python?. However, I haven't really found any comparison regarding Pandas and MySQL.
Use Case:
I am working on text dataset that consists of 1,737,123 rows and 8 columns. I am feeding this dataset into RNN/LSTM network. I do some preprocessing in prior to feeding which is encoding using a customized encoding algorithm.
More details
I have 250+ experiments to do and 12 architectures (different models design) to try.
I am confused, I feel I miss something.
There's no comparison online 'cuz these two scenarios give different results:
With Pandas, you end up with a Dataframe in memory (as a NumPy ndarray under the hood), accessible as native Python objects
With MySQL client, you end up with data in a MySQL database on disk (unless you're using an in-memory database), accessible via IPC/sockets
So, the performance will depend on
how much data needs to be transferred by lower-speed channels (IPC, disk, network)
how comparatively fast is transferring vs processing (which of them is the bottleneck)
which data format your processing facilities prefer (i.e. what additional conversions will be involved)
E.g.:
If your processing facility can reside in the same (Python) process that will be used to read it, reading it directly into Python types is preferrable since you won't need to transfer it all to the MySQL process, then back again (converting formats each time).
OTOH if your processing facility is implemented in some other process and/or language, or e.g. resides within a computing cluster, hooking it to MySQL directly may be faster by eliminating the comparatively slow Python from equation, and because you'll need to be transferring the data again and converting it into the processing app's native objects anyway.
I want to create a predictive model on several hundred GBs of data. The data needs some not-intensive preprocessing that I can do in pyspark but not in tensorflow. In my situation, it would be much more convenient to directly pass the result of the pre-processing to TF, ideally treating the pyspark data frame as a virtual input file to TF, instead of saving the pre-processed data to disk. However, I haven't the faintest idea how to do that and I couldn't find anywhere on the internet.
After some thought, it seems to me that I actually need an iterator (like as defined by tf.data.Iterator) over spark's data. However, I found comments online that hint to the fact that the distributed structure of spark makes it very hard, if not impossible. Why so? Imagine that I don't care about the order of the lines, why should it be impossible to iterate over the spark data?
It sounds like you simply want to use tf.data.Dataset.from_generator() you define a python generator which reads samples out of spark. Although I don't know spark very well, I'm certain you can do a reduce to the server that will be running the tensorflow model. Better yet, if you're distributing your training you can reduce to the set of servers who need some shard of your final dataset.
The import data programmers guide covers the Dataset input pipeline in more detail. The tensorflow Dataset will provide you with an iterator that's accessed directly by the graph so there's no need for tf.placeholders or marshaling data outside of the tf.data.Dataset.from_generator() code you write.
I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.
I am currently working on my thesis, which involves dealing with quite a sizable dataset: ~4mln observations and ~260ths features. It is a dataset of chess games, where most of the features are player dummies (130k for each colour).
As for the hardware and the software, I have around 12GB of RAM on this computer. I am doing all my work in Python 3.5 and use mainly pandas and scikit-learn packages.
My problem is that obviously I can't load this amount of data to my RAM. What I would love to do is to generate the dummy variables, then slice the database into like a thousand or so chunks, apply the Random Forest and aggregate the results again.
However, to do that I would need to be able to first create the dummy variables, which I am not able to do due to memory error, even if I use sparse matrices. Theoretically, I could just slice up the database first, then create the dummy variables. However, the effect of that will be that I will have different features for different slices, so I'm not sure how to aggregate such results.
My questions:
1. How would you guys approach this problem? Is there a way to "merge" the results of my estimation despite having different features in different "chunks" of data?
2. Perhaps it is possible to avoid this problem altogether by renting a server. Are there any trial versions of such services? I'm not sure exactly how much CPU/RAM would I need to complete this task.
Thanks for your help, any kind of tips will be appreciated :)
I would suggest you give CloudxLab a try.
Though it is not free it is quite affordable ($25 for a month). It provides complete environment to experiment with various tools such as HDFS, Map-Reduce, Hive, Pig, Kafka, Spark, Scala, Sqoop, Oozie, Mahout, MLLib, Zookeeper, R, Scala etc. Many of the popular trainers are using CloudxLab.
I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).
I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).
Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?
EDIT: Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.
We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.
HDF5 advantages:
data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
APIs are available for different platforms and languages
structure data using groups
annotating data using attributes
worry-free built-in data compression
io on single datasets is fast
HDF5 pitfalls:
Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)
I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.
Three reasons:
you can mine the source code of any of those packages for ideas that might help you generally
you might find the package names useful in searching for Python equivalents; a lot of R users are Python users, too
under some circumstances, it might prove convenient to just link to R for a particular analysis using one of the above-linked packages and then draw the results back into Python
Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.