Caching a data frame in joblib - python

Joblib has functionality for sharing Numpy arrays across processes by automatically memmapping the array. However this makes use of Numpy specific facilities. Pandas does use Numpy under the hood, but unless your columns all have the same data type, you can't really serialize a DataFrame to a single Numpy array.
What would be the "right" way to cache a DataFrame for reuse in Joblib?
My best guess would be to memmap each column separately, then reconstruct the dataframe inside the loop (and pray that Pandas doesn't copy the data). But that seems like a pretty intensive process.
I am aware of the standalone Memory class, but it's not clear if that can help.

Related

How to import hdf5 data with dask in parallel and create a dask dataframe?

I am completely stuck, hence I am looking for kind advice.
My aim is to read out many hdf5 files in parallel, extract the multi-dim arrays inside, and store each array in one row, precisely a cell, of a dask dataframe. I don't opt for a pandas df, because I believe it will be too big.
It is not possible to read with read_hdf() from dask hdf5 files created with h5py.
What could I do to import thousands of hdf5-files with dask in paralleL and get access to the multi-dim arrays inside?
I would like to create a dask dataframe, where each 2d-array (extracted from the n-dim arrays inside the hdfs) is stored in one cell of a dask dataframe.
Consequently, the number of row corresponds to the number of total arrays found in all files, here 9. I store in one column the arrays.
In the future I would like to append more columns with other data to this dask dataframe. and I would like to operate on the arrays with another Python library and store the results in other columns of the dask dataframe. The dataframe should contain all the information I extract and manipulate. I would also like to add data from other hdf5 files. Like a minidata base. Is this reasonable?
I can work in parallel because the arrays are independent of each other.
How would you realise this, please? xarray was suggested to me as well, but I don't know what's the best way.
Earlier I tried to collect all arrays in a multi-dimensional dask array, but the conversion to a dataframe is only possible for ndim=2.
Thank you for your advice. Have a good day.
import numpy as np
import h5py
import dask.dataframe as dd
import dask.array as da
import dask
print('This is dask version', dask.__version__)
ra=np.ones([10,3199,4000])
print(ra.shape)
file_list=[]
for i in range(0,4):
#print(i)
fstr='data_{0}.h5'.format(str(i))
#print(fstr)
hf = h5py.File('./'+fstr, 'w')
hf.create_dataset('dataset_{0}'.format(str(i)), data=ra)
hf.close()
file_list.append(fstr)
!ls
print(file_list)
for i,fn in enumerate(file_list):
dd.read_hdf(fn,key='dataset_{0}'.format(str(i))) #breaks here
You can pre-process the data into dataframes using dask.distributed and then convert the futures to a single dask.dataframe using dask.dataframe.from_delayed.
from dask.distributed import Client
import dask.dataframe as dd
client = Client()
def preprocess_hdf_file_to_dataframe(filepath):
# process your data into a dataframe however you want, e.g.
with xr.open_dataset(filepath) as ds:
return ds.to_dataframe()
files = ['file1.hdf5', 'file2.hdf5']
futures = client.map(preprocess_hdf_file_to_dataframe, files)
df = dd.from_delayed(futures)
That said, this seems like a perfect use case for xarray, which can read HDF5 files and work with dask natively, e.g.
ds = xr.open_mfdataset(files)
This dataset is similar to a dask.dataframe, in that it contains references to dask.arrays which are read from the file. But xarray is built to handle N-dimensional arrays natively and can work much more naturally with the HDF5 format.
There are certainly areas where dataframes make more sense than a Dataset or DataArray, though, and converting between them can be tricky with larger-than-memory data, so the first approach is always an option if you want a dataframe.

Iterating Dask Dataframe

I'm trying to create a Keras Tokenizer out of a single column from hundreds of large CSV files. Dask seems like a good tool for this. My current approach eventually causes memory issues:
df = dd.read_csv('data/*.csv', usecol=['MyCol'])
# Process column and get underlying Numpy array.
# This greatly reduces memory consumption, but eventually materializes
# the entire dataset into memory
my_ids = df.MyCol.apply(process_my_col).compute().values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(my_ids)
How can I do this by parts? Something along the lines of:
df = pd.read_csv('a-single-file.csv', chunksize=1000)
for chunk in df:
# Process a chunk at a time
Dask DataFrame is technically a set of pandas dataframes, called partitions. When you get the underlying numpy array you are destroying the partitioning structure and it will be one big array. I recommend using map_partition function of Dask DataFrames to apply regular pandas functions on each partition separately.
I also recommend map_partition when it suits your problem. However, if you really just want sequential access, and an API similar to read_csv(chunksize=...) then you might be looking for the partitions attribute
for part in df.partitions:
process(model, part.compute())

What is the Spark DataFrame method `toPandas` actually doing?

I'm a beginner of Spark-DataFrame API.
I use this code to load csv tab-separated into Spark Dataframe
lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('\t'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)
Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),
Does it store the Pandas object to local memory?
Does Pandas low-level computation handled all by Spark?
Does it exposed all pandas dataframe functionality?(I guess yes)
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.
It seems like you might be misunderstanding the use cases of the technologies in play here.
Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.
In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.
Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.
Now to answer your questions:
Does it store the Pandas object to local memory:
Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.
Does Pandas low-level computation handled all by Spark
No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.
Does it exposed all pandas dataframe functionality?
No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.
Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.
Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.
Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).
Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

Process Pandas DataFrames which don't fit in memory

I'm manipulating a huge DataFrame stored using HDFStore objects, the table is too big to be completely loaded in memory so I have to extract the data chunck by chunk, which is fine for a lot of tasks.
Here comes my problem, I would like to apply a PCA on the table which requires the whole DataFrame to be loaded but I don't have enough memory to do that.
The PCA function takes a numpy array or a pandas DataFrame as input, is there another way to apply a PCA that would directly use an object stored on disk?
Thank you a lot in advance,
ClydeX
Seems like a perfect fit for the new IncrementalPCA in the 0.16 dev branch of scikit-learn.
Update: link to the latest stable version

How to save big array so that it will take less memory in python?

I am new to python. I have a big array, a, with dimensions such as (43200, 4000) and I need to save this, as I need it for future processing. when I try to save it with a np.savetxt, the txt file is too large and my program runs into memory error as I need to process 5 files of same size. Is there any way to save huge arrays so that it will take less memory?
Thanks.
Saving your data to text file is hugely inefficient. Numpy has built-in saving commands save, and savez/savez_compressed which would be much better suited to storing large arrays.
Depending on how you plan to use your data, you should also look into HDF5 format (h5py or pytables), which allows you to store large data sets, without having to load it all in memory.
You can use PyTables to create a Hierarchical Data Format (HDF) file to store the data. This provides some interesting in-memory options that link the object you're working with to the file it's saved in.
Here is another StackOverflow questions that demonstrates how to do this: "How to store a NumPy multidimensional array in PyTables."
If you are willing to work with your array as a Pandas DataFrame object, you can also use the Pandas interface to PyTables / HDF5, e.g.:
import pandas
import numpy as np
a = np.ones((43200, 4000)) # Not recommended.
x = pandas.HDFStore("some_file.hdf")
x.append("a", pandas.DataFrame(a)) # <-- This will take a while.
x.close()
# Then later on...
my_data = pandas.HDFStore("some_file.hdf") # might also take a while
usable_a_copy = my_data["a"] # Be careful of the way changes to
# `usable_a_copy` affect the saved data.
copy_as_nparray = usable_a_copy.values
With files of this size, you might consider whether your application can be performed with a parallel algorithm and potentially applied to only subsets of the large arrays rather than needing to consume all of the array before proceeding.

Categories