Loading bigger than memory hdf5 file in pyspark

Loading bigger than memory hdf5 file in pyspark - python

I have a big file (say 20 Gb) stored in HDF5 format. The file is basically a set of 3D coordinates that evolve over time (a molecular simulation trajectory). This basically is an array of shape (8000 (frames), 50000 (particles), 3 (coordinates))
In regular python I would simply load the hdf5 datafile using for h5py or pytables and index the datafile like if it was a numpy (the library lazily loads whatever data it needs).
However, if I try to load this file in Spark using SparkContext.parallelize it obviously clogs the memory:
sc.parallelize(data, 10)
How can I handle this problem? Is there a preferred data format for huge arrays? Can I make the rdd to be written on disk without passing by memory?

Spark (and Hadoop) doesn't have support for reading parts of the HDF5 binary files. (I suspect that the reason for this is that HDF5 is a container format for storing documents and it allows to specify tree like hierarchy for the documents).
But if you need to read file from the local disk - it is doable with Spark especially if you know internal structure of your HDF5 file.
Here is an example - it assumes that you'll run local spark job, and you know in advance that your HDF5 dataset '/mydata' consists out of 100 chunks.
h5file_path="/absolute/path/to/file"
def readchunk(v):
empty = h5.File(h5file_path)
return empty['/mydata'][v,:]
foo = sc.parallelize(range(0,100)).map(lambda v: readchunk(v))
foo.count()
Going further you can modify the program to detect the number of chunks using f5['/mydata'].shape[0]
The next step would be to iterate over multiple datasets (you can list data sets with f5.keys()).
Also there is another article "From HDF5 Datasets to Apache Spark RDDs" that describe similar approach.
The same approach would work on a distributed cluster, but it gets little inefficient. h5py requires the file to in on a local file system. So this can be achieved in several ways: copy the file to all workers and keep it under the same location on worker's disk; or put the file to HDFS and mount HDFS using fusefs - so workers could access the file. Both ways have some inefficiencies, but it should be good enough for ad-hoc tasks.
Here is optimized version that opens h5 only once on every executor:
h5file_path="/absolute/path/to/file"
_h5file = None
def readchunk(v):
# code below will be executed on executor - in another python process on remote server
# original value for _h5file (None) is sent from driver
# and on executor is updated to h5.File object when the `readchunk` is called for the first time
global _h5file
if _h5file is None:
_h5file = h5.File(h5file_path)
return _h5file['/mydata'][v,:]
foo = sc.parallelize(range(0,100)).map(lambda v: readchunk(v))
foo.count()

Related

Spark - Skewed Input dataframe

i am working with a heavily nested non-splittable json format input dataset housed in S3.
The files can vary a lot in their sizes - minimum is 10kb while other is 300 MB.
When reading the file using the below code, and just doing a simple repartition to desired number of partitions leads to straggling tasks - most tasks finish within seconds but one would last for couple hours and then runs into memory issues (heartbeat missing/ heap space etc.)
I repartition in an attempt to randomize the partition to file mapping since spark may be reading files in sequence and files within same directory tend to have same nature -all large/all small etc.
df = spark.read.json('s3://my/parent/directory')
df.repartition(396)
# Settings (few):
default parallelism = 396
total number of cores = 400
What I tried:
I figured that the input partitions (s3 partitions not spark) scheme (folder hierarchy) might be leading to this skewed partitions problem, where some s3 folders (techinically 'prefixes') have just one file while other has thousands, so I transformed the input to a flattened directory structure using hashcodewhere each folder has just one file:
Earlier:
/parent1
/file1
/file2
.
.
/file1000
/parent2/
/file1
Now:
hashcode=FEFRE#$#$$#FE/parent1/file1
hashcode=#$#$#Cdvfvf##/parent1/file1
But it didnt have any effect.
I have tried with really large clusters too - thinking that even if there is input skew - that much memory should be able to handle the larger files. But I still get into the straggling tasks.
When I check the number of files (each file becomes a row in dataframe due to its nested - unsplittable nature) assigned to each partition - I see number of files assigned to be between 2 to 32. Is it because spark picks up the files in partitions based on spark.sql.files.maxPartitionBytes - and probably its assigning only two files where the file size is huge , and much more files to single partition when the filesize is less?
Any recommendations to make the job work properly, and distribute the tasks uniformly - given size of input files is something that can not be changed due to nature of input files.
EDIT: Code added for latest trial per Abdennacer's comment
This is the entire code.
Results:
The job gets stuck in Running - even though in worker logs I see the task finished. The driver log has error 'can not increase buffer size' - I dont know what is causing the 2GB buffer issue, since I am not issuing any 'collect' or similar statement.
Configuration for this job is like below to ensure the executor has huge memory per task.
Driver/Executor Cores: 2.
Driver/ Executor memory: 198 GB.
# sample s3 source path:
s3://masked-bucket-urcx-qaxd-dccf/gp/hash=8e71405b-d3c8-473f-b3ec-64fa6515e135/group=34/db_name=db01/2022-12-15_db01.json.gz
#####################
# Boilerplate
#####################
spark = SparkSession.builder.appName('query-analysis').getOrCreate()
#####################
# Queries Dataframe
#####################
dfq = spark.read.json(S3_STL_RAW, primitivesAsString="true")
dfq.write.partitionBy('group').mode('overwrite').parquet(S3_DEST_PATH)

Great job flattening the files to increase read speed. Prefixes as you seem to understand are related to buckets and bucket read speed is related to the number of files under each prefix and their size. The approach you took will up reading faster than you original strategy. It will not help you with skew of the data itself.
One thing you might consider is that your raw data and working data do not need to be the same set of files. There is a strategy for landing data and then pre-processing it for performance.
That is to say keep the raw data in the format that you have now, then make a copy of the data in a more convenient format for regulatory queries. (Parquet is the best choice for working with S3).
Land data a 'landing zone'
As needed process the data stored in the landing zone, into a convenient splittable format for querying. ('pre-processed folder' )
Once your raw data is processed move it to a 'processed folder'. (Use Your existing flat folder structure.) This processing table is important should you need to rebuild the table or make changes to the table format.
Create a view that is a union of data in the 'landing zone' and the 'pre-pocesssed' folder. This gives you a performant table with up to date data.
If you are using the latest S3 you should get consistent reads, that allow you to ensure you are querying on all the data. In days of the past S3 was eventually consistent meaning you might miss some data while it's in transit, this issue is supposedly fixed in the recent version of S3. Run this 'processing' as often as needed and you should have a performant table to run large queries on.
S3 was designed as a long term cheap storage. It's not made to perform quickly, but they've been trying to make it better over time.
It should be noted this will solve skew on read but won't solve skew on query. To solve the query portion you can enable Adaptive query in your config. (this will adaptively add an extra shuffle to make queries run faster.)
spark.sql.adaptive.enabled=true

Dataset Wrapper Class for Parallel Reads of HDF5 via Multiprocessing

I am needing to manage a large amount of physiological waveform data, like ECGs, and so far have found HDF5 to be the best for compatibility when doing machine learning/data science in Python (pandas, PyTorch, scikit-learn, etc). The ability to slice/query/read only certain rows of a dataset is particularly appealing. However, I am struggling to develop a stable wrapper class which allows for simple yet reliable parallel reads from many multiprocessing workers. I am fine with single threaded writes as I only have to ETL my source data into the HDF5 once, but lacking parallel reads really hurts run times during data analysis, PyTorch model runs, etc. I have tried both PyTables via Pandas and h5py with the following scenarios:
One large HDF5 file
Could only get single process reads stable. mp pretty quickly corrupted the entire file.
One HDF5 file per time series interval
ie, hundreds of millions of files. Still having collisions leading to corrupted files even when opening read only with a context manager. Also means backing up or transferring so many files is very slow due to the high IO cost.
Apologies for no code samples, but I've tried dozens of approaches over the last few months (each was hundreds of lines) and eventually always ran into stability issues. Is there any good example code of how to build a dataset wrapper class backed by HDF5? AFAIK, h5py's MPI-based "Parallel HDF5" is not relevant since this is on one machine, and "Single Writer Multiple Reader" is not necessary as I am fine writing single-threaded before any reading. Ideally, I'd hope that whenever I need data in my project, I could just do similar to the following:
import dataset # my HDF5 dataset wrapper class
import multiprocessing as mp
def dataloader(idxs):
temp = []
ds = dataset.Dataset()
for _, idx in idxs.iterrows():
df = ds.get_data(idx)
# do some data wrangling in parallel
temp.append(df)
return temp
pool = mp.Pool(6)
data = pool.map(dataloader, indices)

Numpy memory mapping issue

I've been recently working with large matrices. My inputs are stored in form of 15GB .npz files, which I am trying to read incrementally, in small batches.
I am familiar with memory mapping, and having seen numpy also supports these kinds of operations seemed like a perfect solution. However, the problem I am facing is as follows:
I first load the matrix with:
foo = np.load('matrix.npz',mmap_mode="r+")
foo has a single key: data.
When I try to, for example do:
foo['data'][1][1]
numpy seems to endlessly spend the available RAM, almost as if there were no memory mapping. Am I doing anything wrong?
My goal would be, for example, to read 30 lines at a time:
for x in np.arange(0,matrix.shape[1],30):
batch = matrix[x:(x+30),:]
do_something_with(batch)
Thank you!

My guess would be that mmap_mode="r+" is ignored when the file in question is a zipped numpy file. I haven't used numpy in this way, so some of what follows is my best guess. The documentation for load states
If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.
No mention of what it does with mmap_mode. However in the code for loading .npz files no usage is made of the mmap_mode keyword:
if magic.startswith(_ZIP_PREFIX):
# zip-file (assume .npz)
# Transfer file ownership to NpzFile
tmp = own_fid
own_fid = False
return NpzFile(fid, own_fid=tmp, allow_pickle=allow_pickle, pickle_kwargs=pickle_kwargs)
So, your initial guess is indeed correct. Numpy uses all of the ram because there is no memmapping ocurring. This is a limitation of the implementation of load; since the npz format is an uncompressed zip archive it should be possible to memmap the variables (unless of course your files were created with savez_compressed).
Implementing a load function that memmaps npz would be quite a bit of work though, so you may want to take a look at structured arrays. They provide similar usage (access of fields by keys) and are already compatible with memmapping.

Dask Read Data from Binary File

I am trying to implement an out of core processing version of k-means clustering algorithm in python. I learned about dask from this git project K-Mean Parallel... Dask...
I use the same git project but am trying to load my data which is in the form of a binary file. The binary file contains data points with 1024 floating point features each.
My problem is how do I load data if it is very huge i.e. larger than the available memory itself? I tried to use the numpy's fromFile function but my kernel somehow dies. Some of my questions are:
Q. Is it possible to load data into numpy created from some other source (the file was not created by numpy but a c script)?
Q. Is there a module for dask that can load data directly from a binary file? I have seen csv files used but nothing related to binary files.

I only dabble in Dask, but calling np.fromfile in the code below via Dask delayed should allow you to work with it lazily. That said, I'm working on this myself so this is currently a partial answer.
For your first question: I currently load .bin files created by a Labview program with no issues using similar code to this:
import numpy as np
method = "b+r" # binary read method
chunkSize = 1e6 # chunk as needed for your purposes
fileSize = os.path.getsize(myfile)
data = []
with open(myfile,method) as file:
for chunk in range(0,fileSize,chunkSize):
data.append(np.fromfile(file,dtype=np.float32,chunk))
For the second question: I have not been able to find anything in Dask for dealing with binary files. I find that a conversion to something that Dask can use, is worth it.

Creating a large pd.dataframe - how?

I want to create a large pd.dataframe, out of 7 files 4GB .txt files, which I want to work with + save to .csv
What I did:
created a for loop and opened-concated one by one on axis=0, and so continuing my index (a timestamp).
However I am running into memory problems, even though I am working on a 100GB Ram server. I read somewhere that pandas takes up 5-10x of the data size.
What are my alternatives?
One is creating an empty csv - opening it + the txt + append a new chunk and saving.
Other ideas?

Creating hdf5 file with h5py library will allow you to create one big dataset and access it without loading all the data into the memory.
This answer provides an example of how to create and incrementally increase the hdf5 dataset: incremental writes to hdf5 with h5py

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.