Creating a large pd.dataframe - how? - python

I want to create a large pd.dataframe, out of 7 files 4GB .txt files, which I want to work with + save to .csv
What I did:
created a for loop and opened-concated one by one on axis=0, and so continuing my index (a timestamp).
However I am running into memory problems, even though I am working on a 100GB Ram server. I read somewhere that pandas takes up 5-10x of the data size.
What are my alternatives?
One is creating an empty csv - opening it + the txt + append a new chunk and saving.
Other ideas?

Creating hdf5 file with h5py library will allow you to create one big dataset and access it without loading all the data into the memory.
This answer provides an example of how to create and incrementally increase the hdf5 dataset: incremental writes to hdf5 with h5py

Related

Problem exporting Pandas DataFrame with object containing documents as bytes

I am trying to export a DataFrame that contains documents as byte objects, however, I can not find a suitable file format that does not involve the relatively small (memory usage: 254.3+ KB) DataFrame expanding into something in the range of 100's of MB - even 1GB+.
So far I have tried to export the DataFrame as CSV and HDF5.
The column causing this huge expansion contains either .pdf, .doc, .txt or .msg files in byteformat:
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n256...
which was initially stored on a SQL-server as varbinary(max) and loaded by pandas default settings.
I have simply tried using pandas to export the DataFrame using:
df.to_csv('.csv') and
data_stored = pd.HDFStore('documents.h5')
data_stored['document'] = df
I wanted to keep the output data compact, as I would simply like to be able to load the data again at another time. The problem, however, is that the exports result in either a huge CSV or .h5 file. I guess there is some file-format that keeps the format and size of a pd.DataFrame?
I ended up exporting using df.to_pickle. Also I discovered that the size of the dataframe was indeed much larger than I initially thought, since the pandas method .info did not include the enormous amount of overhead memory. Instead, to view the entire memory, I used df.memory_usage(deep=True).sum() and indeed the dataframe took up around 1.1 GB.

Dimensions of data stored in HDF5

I have several .h5 files which contain Pandas DataFrames created with the .to_hdf method. My question is quite simple : is it possible to retrieve the dimension of the DataFrame stored in the .h5 file without loading all the data in RAM ?
Motivation : the DataFrames stored in those HDF5 files are quite big (up to several Gb) and loading all the data just to get the shape of the data is really time consuming.
You are probably going to want to use PyTables directly.
The API reference is here, but basically:
from tables import *
h5file = open_file("yourfile.h5", mode="r")
print h5file.root.<yourdataframe>.table.shape
print len(h5file.root.<yourdataframe>.table.cols) - 1 # first col is an index
Also, just for clarity, HDF5 does not read all the data when a dataset is opened. That would be a peculiarity of Pandas.

Reading file with huge number of columns in python

I have a huge file csv file with around 4 million column and around 300 rows. File size is about 4.3G. I want to read this file and run some machine learning algorithm on the data.
I tried reading the file via pandas read_csv in python but it is taking long time for reading even a single row ( I suspect due to large number of columns ). I checked few other options like numpy fromfile, but nothing seems to be working.
Can someone please suggest some way to load file with many columns in python?
Pandas/numpy should be able to handle that volume of data no problem. I hope you have at least 8GB of RAM on that machine. To import a CSV file with Numpy, try something like
data = np.loadtxt('test.csv', dtype=np.uint8, delimiter=',')
If there is missing data, np.genfromtext might work instead. If none of these meet your needs and you have enough RAM to hold a duplicate of the data temporarily, you could first build a Python list of lists, one per row using readline and str.split. Then pass that to Pandas or numpy, assuming that's how you intend to operate on the data. You could then save it to disk in a format for easier ingestion later. hdf5 was already mentioned and is a good option. You can also save a numpy array to disk with numpy.savez or my favorite the speedy bloscpack.(un)pack_ndarray_file.
csv is very inefficient for storing large datasets. You should convert your csv file into a better suited format. Try hdf5 (h5py.org or pytables.org), it is very fast and allows you to read parts of the dataset without fully loading it into memory.
According to this answer, pandas (which you already tried) is the fastest library available to read a CSV in Python, or at least was in 2014.

Loading bigger than memory hdf5 file in pyspark

I have a big file (say 20 Gb) stored in HDF5 format. The file is basically a set of 3D coordinates that evolve over time (a molecular simulation trajectory). This basically is an array of shape (8000 (frames), 50000 (particles), 3 (coordinates))
In regular python I would simply load the hdf5 datafile using for h5py or pytables and index the datafile like if it was a numpy (the library lazily loads whatever data it needs).
However, if I try to load this file in Spark using SparkContext.parallelize it obviously clogs the memory:
sc.parallelize(data, 10)
How can I handle this problem? Is there a preferred data format for huge arrays? Can I make the rdd to be written on disk without passing by memory?
Spark (and Hadoop) doesn't have support for reading parts of the HDF5 binary files. (I suspect that the reason for this is that HDF5 is a container format for storing documents and it allows to specify tree like hierarchy for the documents).
But if you need to read file from the local disk - it is doable with Spark especially if you know internal structure of your HDF5 file.
Here is an example - it assumes that you'll run local spark job, and you know in advance that your HDF5 dataset '/mydata' consists out of 100 chunks.
h5file_path="/absolute/path/to/file"
def readchunk(v):
empty = h5.File(h5file_path)
return empty['/mydata'][v,:]
foo = sc.parallelize(range(0,100)).map(lambda v: readchunk(v))
foo.count()
Going further you can modify the program to detect the number of chunks using f5['/mydata'].shape[0]
The next step would be to iterate over multiple datasets (you can list data sets with f5.keys()).
Also there is another article "From HDF5 Datasets to Apache Spark RDDs" that describe similar approach.
The same approach would work on a distributed cluster, but it gets little inefficient. h5py requires the file to in on a local file system. So this can be achieved in several ways: copy the file to all workers and keep it under the same location on worker's disk; or put the file to HDFS and mount HDFS using fusefs - so workers could access the file. Both ways have some inefficiencies, but it should be good enough for ad-hoc tasks.
Here is optimized version that opens h5 only once on every executor:
h5file_path="/absolute/path/to/file"
_h5file = None
def readchunk(v):
# code below will be executed on executor - in another python process on remote server
# original value for _h5file (None) is sent from driver
# and on executor is updated to h5.File object when the `readchunk` is called for the first time
global _h5file
if _h5file is None:
_h5file = h5.File(h5file_path)
return _h5file['/mydata'][v,:]
foo = sc.parallelize(range(0,100)).map(lambda v: readchunk(v))
foo.count()

compressing data with HDFStore

I am newbie to pytables and had a question regarding storing compressed pandas DataFrame. My current code is:
import pandas
# HDF5 file name
H5name="C:\\MyDir\\MyHDF.h5"
# create HDF5 file
store=pandas.io.pytables.HDFStore(H5name)
# write a pandas DataFrame to the HDF5 file created
myDF.to_hdf(H5name,"myDFname",append=True)
# read the pandas DataFrame back from the HDF5 file created
myDF1=pandas.io.pytables.read_hdf(H5name,"myDFname")
# close the file
store.close()
When I checked the size of the HDF5 created, the size (212kb) was much larger than the original csv file (58kb) I used to create the pandas DataFrame.
So, I tried out compression by (deleting the HDF5 file) and recreating
# create HDF5 file
store=pandas.io.pytables.HDFStore(H5name,complevel=1)
and the size of the file created did not change. I tried all complevels from 1 to 9 and the size still remained the same.
I tried to add
# create HDF5 file
store=pandas.io.pytables.HDFStore(H5name,complevel=1,complib="zlib")
but it had no change in compression.
What could be the problem?
Also, ideally I would like to use a compression similar to what R does for its save function (e.g. in my case the 58kb file was saved to a size of 27kb in RData)? Do I need to do any additional serialization in Python to reduce the size?
EDIT:
I am using Python 3.3.3 and Pandas 0.13.1
EDIT:
I tried with a larger file 487MB csv file, whose RData size (via R's save function) is 169MB. For larger files, I do see the compressions. Bzip2 gave the best compression of 202MB (level=9) and was the slowest to read/write. Blosc compression (level=9) gave the largest size of 276MB, but was much faster to write/read.
Not sure what R does differently in its save function, but it's both equally fast and much more compressed than any of these compression algos.
You have a really tiny file here. HDF5 basically chunks your data; usually 64KB is a minimum chunk size. Depedening on what the data is, it might not even compress at that size.
You can try msgpack for a simple soln for this size data. HDF5 is quite efficient for larger sizes and will compress quite nicely.

Categories