Process Pandas DataFrames which don't fit in memory - python

I'm manipulating a huge DataFrame stored using HDFStore objects, the table is too big to be completely loaded in memory so I have to extract the data chunck by chunk, which is fine for a lot of tasks.
Here comes my problem, I would like to apply a PCA on the table which requires the whole DataFrame to be loaded but I don't have enough memory to do that.
The PCA function takes a numpy array or a pandas DataFrame as input, is there another way to apply a PCA that would directly use an object stored on disk?
Thank you a lot in advance,
ClydeX

Seems like a perfect fit for the new IncrementalPCA in the 0.16 dev branch of scikit-learn.
Update: link to the latest stable version

Related

Reading large database table into Dask dataframe

I have a 7GB postgresql table which I want to read into python and do some analysis. I cannot use Pandas for it because it is larger than the memory on my local machine. I therefore wanted to try reading the table into Dask Dataframe first, perform some aggregation and switch back to Pandas for subsequent analysis. I used the below lines of code for that.
df = dd.read_sql_table('table_xyz', uri = "postgresql+psycopg2://user:pwd#remotehost/dbname", index_col = 'column_xyz', schema = 'private')
The index_col i.e. 'column_xyz' is indexed in database. This works but when I perform an action for example an aggregation, it takes ages (like an hour) to return the result.
avg = df.groupby("col1").col2.mean().compute()
I understand that Dask is not as fast as Pandas more so when I am working on a single machine and not a cluster. I am wondering whether I am using Dask the right way? If not what is a faster alternative to perform analysis on large tables that do not fit in memory using Python.
If your data fits into the RAM of your machine then you're better off using Pandas. Dask will not outperform Pandas in some cases.
Alternatively you can play around with the chunksize and see if things improve. The best way to figure this out is to look at dask diagnostics tool dashboard and figure out what is taking dask so long. That will help you make a much more informed decision.

How can I create extra columns without going out of RAM, and then use it for ML algorithms?

I have a dataset with 3m+ (ordered) rows, and 100 columns, which I can load into my notebook using Pandas. I would like to append about 900 columns which are calculated using the 100 existing columns. The end goal is to train several machine learning models (NN's, Random Forests).
When I try to append the columns using Pandas, my machine breaks down due to insufficient RAM (I have 25GB). Therefore, I tried to use Dask, which allows me to compute the 900 columns without problems. However, Dask dataframes cannot be used as input for Random Forests (Sklearn) and Keras models, so I guess at some point you have to convert it back to a pandas dataframe.
I am quite stuck at this point. Speed is quite important because I need refit the models often. Does anyone have some good suggestions?
You should try to "shrink" datatypes in pandas if possible, like using uint8 if the columns contains an integer less than 255 etc..
You can find the max value of a datatype suing numpy : numpy.finfo() for float datatypes (float16/32/64) and numpy.iinfo() for integer datatypes ([u]int8/16/32/64)
pandas read_csv
you can also use pandas read_csv chunksize option, if you can save data as a csv, and so iterate on your csv file
you can also combine the 2 methods !
If you want to do ML on a larger-than-RAM dataset then Dask-ML https://ml.dask.org/ may give you what you need. It also has integrations to many common 3rd party tools.
Note that lots of libraries that work with in-RAM data will not work well with out-of-RAM data, typically you only get a subset of ML libraries that work well on out-of-RAM data. An alternative is to build many models, each on a subsample of the data (e.g. split your data into N DataFrames, each of which fits into RAM, then build a model on each one at a time, then combine the predictions from each model as a later step).
Do you really need all 900 columns? One test might be to sub-sample a set of rows (e.g. 100k) with 900 columns and build an RF sklearn model, then ask the model "which columns are most useful?". Maybe you only need a subset of all of the columns and you can discard the rest? Dask can extract a subsample of rows and/or columns to a CSV or Parquet file which Pandas can read back.
Maybe you don't need a powerful model like RF or ANN? If a linear model will do then Dask's ML or a tool like https://vowpalwabbit.org/ might do the job.
Do also think about upgrading your RAM - the developer time spent learning a new tool (like Dask ML) is probably much more expensive than renting a temporary Amazon big-box that has more than enough RAM for your experiments.

Caching a data frame in joblib

Joblib has functionality for sharing Numpy arrays across processes by automatically memmapping the array. However this makes use of Numpy specific facilities. Pandas does use Numpy under the hood, but unless your columns all have the same data type, you can't really serialize a DataFrame to a single Numpy array.
What would be the "right" way to cache a DataFrame for reuse in Joblib?
My best guess would be to memmap each column separately, then reconstruct the dataframe inside the loop (and pray that Pandas doesn't copy the data). But that seems like a pretty intensive process.
I am aware of the standalone Memory class, but it's not clear if that can help.

Working with large (+15 gb) CSV datasets and Pandas/XGBoost

I am trying to find a means of starting to work with very large CSV files in Pandas, ultimately to be able to do some machine learning with XGBoost.
I am torn between using mySQL or some sqllite framework to manage chunks of my data; my issue is in the machine learning aspect of it later on, and in loading in chunks at a time to train the model.
My other thought was to use Dask, which is built of off Pandas, but also has XGBoost functionality.
I'm not sure what the best starting point is and was hoping to ask for an opinion! I am leaning towards Dask but I have not used it yet.
This blogpost goes through an example using XGBoost on a large CSV dataset. However it did so by using a distributed cluster with enough RAM to fit the entire dataset in memory at once. While many dask.dataframe operations can operate in small space I don't think that XGBoost training is likely to be one of them. XGBoost seems to operate best when all data is available all the time.
I haven't tried this, but I would load your data into an hdf5 file using h5py. This library let's you store data on disk but access it like a numpy array. Therefore you are no longer constrained by memory for your dataset.
For the XGBoost part, I would use the sklearn API and pass in the h5py object as the X value. I recommend the sklearn API since it accepts numpy like arrays for input which should let h5py objects work. Make sure to use a small value for subsample otherwise you'll likely run out of memory fast.

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

Categories