I am needing to manage a large amount of physiological waveform data, like ECGs, and so far have found HDF5 to be the best for compatibility when doing machine learning/data science in Python (pandas, PyTorch, scikit-learn, etc). The ability to slice/query/read only certain rows of a dataset is particularly appealing. However, I am struggling to develop a stable wrapper class which allows for simple yet reliable parallel reads from many multiprocessing workers. I am fine with single threaded writes as I only have to ETL my source data into the HDF5 once, but lacking parallel reads really hurts run times during data analysis, PyTorch model runs, etc. I have tried both PyTables via Pandas and h5py with the following scenarios:
One large HDF5 file
Could only get single process reads stable. mp pretty quickly corrupted the entire file.
One HDF5 file per time series interval
ie, hundreds of millions of files. Still having collisions leading to corrupted files even when opening read only with a context manager. Also means backing up or transferring so many files is very slow due to the high IO cost.
Apologies for no code samples, but I've tried dozens of approaches over the last few months (each was hundreds of lines) and eventually always ran into stability issues. Is there any good example code of how to build a dataset wrapper class backed by HDF5? AFAIK, h5py's MPI-based "Parallel HDF5" is not relevant since this is on one machine, and "Single Writer Multiple Reader" is not necessary as I am fine writing single-threaded before any reading. Ideally, I'd hope that whenever I need data in my project, I could just do similar to the following:
import dataset # my HDF5 dataset wrapper class
import multiprocessing as mp
def dataloader(idxs):
temp = []
ds = dataset.Dataset()
for _, idx in idxs.iterrows():
df = ds.get_data(idx)
# do some data wrangling in parallel
temp.append(df)
return temp
pool = mp.Pool(6)
data = pool.map(dataloader, indices)
Related
I have a large set of Parquet files that I am trying to sort on a column. Uncompressed, the data is around ~14Gb, so Dask seemed like the right tool for the job. All I'm doing with Dask is:
Reading the parquet files
Sorting on one of the columns (called "friend")
Writing as parquet files in a separate directory
I can't do this without the Dask process (there's just one, I'm using the synchronous scheduler) running out of memory and getting killed. This surprises me, because no one partition is more than ~300 mb uncompressed.
I've written a little script to profile Dask with progressively larger portions of my dataset, and I've noticed that Dask's memory consumption scales with the size of the input. Here's the script:
import os
import dask
import dask.dataframe as dd
from dask.diagnostics import ResourceProfiler, ProgressBar
def run(input_path, output_path, input_limit):
dask.config.set(scheduler="synchronous")
filenames = os.listdir(input_path)
full_filenames = [os.path.join(input_path, f) for f in filenames]
rprof = ResourceProfiler()
with rprof, ProgressBar():
df = dd.read_parquet(full_filenames[:input_limit])
df = df.set_index("friend")
df.to_parquet(output_path)
rprof.visualize(file_path=f"profiles/input-limit-{input_limit}.html")
Here are the charts produced by the visualize() call:
Input Limit = 2
Input Limit = 4
Input Limit = 8
Input Limit = 16
The full dataset is ~50 input files, so at this rate of growth I'm not surprised that job eats up all of the memory on my 32gb machine.
My understanding is that the whole point of Dask is to allow you to operate on larger-than-memory datasets. I get the impression that people are using Dask to process datasets far larger than my ~14gb one. How do they avoid this issue with scaling memory consumption? What am I doing wrong here?
I'm not interested in using a different scheduler or in parallelism at this point. I'd just like to know why Dask is consuming so much more memory than I would have thought necessary.
This turns out to have been a performance regression in Dask that was fixed in the 2021.03.0 release.
See this Github issue for more info.
I have a 500+ MB CSV data file. My question is, which would be faster for data manipulation (e.g., reading, processing) is the Python MySQL client would be faster since all work is mapped into SQL queries and optimization is left to the optimizer. But, at the same time Pandas is dealing with a file which should be faster than communicating with a server?
I have already checked "Large data" work flows using pandas, Best practices for importing large CSV files, Fastest way to write large CSV with Python, and Most efficient way to parse a large .csv in python?. However, I haven't really found any comparison regarding Pandas and MySQL.
Use Case:
I am working on text dataset that consists of 1,737,123 rows and 8 columns. I am feeding this dataset into RNN/LSTM network. I do some preprocessing in prior to feeding which is encoding using a customized encoding algorithm.
More details
I have 250+ experiments to do and 12 architectures (different models design) to try.
I am confused, I feel I miss something.
There's no comparison online 'cuz these two scenarios give different results:
With Pandas, you end up with a Dataframe in memory (as a NumPy ndarray under the hood), accessible as native Python objects
With MySQL client, you end up with data in a MySQL database on disk (unless you're using an in-memory database), accessible via IPC/sockets
So, the performance will depend on
how much data needs to be transferred by lower-speed channels (IPC, disk, network)
how comparatively fast is transferring vs processing (which of them is the bottleneck)
which data format your processing facilities prefer (i.e. what additional conversions will be involved)
E.g.:
If your processing facility can reside in the same (Python) process that will be used to read it, reading it directly into Python types is preferrable since you won't need to transfer it all to the MySQL process, then back again (converting formats each time).
OTOH if your processing facility is implemented in some other process and/or language, or e.g. resides within a computing cluster, hooking it to MySQL directly may be faster by eliminating the comparatively slow Python from equation, and because you'll need to be transferring the data again and converting it into the processing app's native objects anyway.
I'm trying out dask on a simple embarassingly parallel reading of 24 scientific data files, each of ~250MB, so total ~6GB. The data is in a 2D array format. Its stored on a parallel file system, and read in from a cluster, though I'm reading only from a single node right now. The data is in a format similar to HDF5 (called Adios), and is read similar to h5py package. Each file takes about 4 seconds to read. I'm following the example of skimage read here (http://docs.dask.org/en/latest/array-creation.html). However, I never get a speed up, no matter how many workers. I thought perhaps I was using it wrong, and perhaps only using 1 worker still, but when I profile it, there does appear to be 24 workers. How can I get a speed up for reading this data?
import adios as ad
import numpy as np
import dask.array as da
import dask
bpread = dask.delayed(lambda f: ad.file(f)['data'][...],pure=True)
lazy_datas = [bpread(path) for path in paths]
sample = lazy_datas[0].compute()
#read in data
arrays = [da.from_delayed(lazy_data,dtype=sample.dtype,shape=sample.shape) for lazy_data in lazy_datas]
datas = da.stack(arrays,axis=0)
datas2 = datas.compute(scheduler='processes',num_workers=24)
I recommend looking at the /profile tab of the scheduler's dashboard. This will tell you what lines of code are taking up the most time.
My first guess is that you are already maxing out your disk's ability to serve data to you. You aren't CPU bound, so adding more cores won't help. That's just a guess though, as always you'll have to profile and investigate your situation further to know for sure.
I'm trying to apply machine learning (Python with scikit-learn) to a large data stored in a CSV file which is about 2.2 gigabytes.
As this is a partially empirical process I need to run the script numerous times which results in the pandas.read_csv() function being called over and over again and it takes a lot of time.
Obviously, this is very time consuming so I guess there is must be a way to make the process of reading the data faster - like storing it in a different format or caching it in some way.
Code example in the solution would be great!
I would store already parsed DFs in one of the following formats:
HDF5 (fast, supports conditional reading / querying, supports various compression methods, supported by different tools/languages)
Feather (extremely fast - makes sense to use on SSD drives)
Pickle (fast)
All of them are very fast
PS it's important to know what kind of data (what dtypes) you are going to store, because it might affect the speed dramatically
I have a big file (say 20 Gb) stored in HDF5 format. The file is basically a set of 3D coordinates that evolve over time (a molecular simulation trajectory). This basically is an array of shape (8000 (frames), 50000 (particles), 3 (coordinates))
In regular python I would simply load the hdf5 datafile using for h5py or pytables and index the datafile like if it was a numpy (the library lazily loads whatever data it needs).
However, if I try to load this file in Spark using SparkContext.parallelize it obviously clogs the memory:
sc.parallelize(data, 10)
How can I handle this problem? Is there a preferred data format for huge arrays? Can I make the rdd to be written on disk without passing by memory?
Spark (and Hadoop) doesn't have support for reading parts of the HDF5 binary files. (I suspect that the reason for this is that HDF5 is a container format for storing documents and it allows to specify tree like hierarchy for the documents).
But if you need to read file from the local disk - it is doable with Spark especially if you know internal structure of your HDF5 file.
Here is an example - it assumes that you'll run local spark job, and you know in advance that your HDF5 dataset '/mydata' consists out of 100 chunks.
h5file_path="/absolute/path/to/file"
def readchunk(v):
empty = h5.File(h5file_path)
return empty['/mydata'][v,:]
foo = sc.parallelize(range(0,100)).map(lambda v: readchunk(v))
foo.count()
Going further you can modify the program to detect the number of chunks using f5['/mydata'].shape[0]
The next step would be to iterate over multiple datasets (you can list data sets with f5.keys()).
Also there is another article "From HDF5 Datasets to Apache Spark RDDs" that describe similar approach.
The same approach would work on a distributed cluster, but it gets little inefficient. h5py requires the file to in on a local file system. So this can be achieved in several ways: copy the file to all workers and keep it under the same location on worker's disk; or put the file to HDFS and mount HDFS using fusefs - so workers could access the file. Both ways have some inefficiencies, but it should be good enough for ad-hoc tasks.
Here is optimized version that opens h5 only once on every executor:
h5file_path="/absolute/path/to/file"
_h5file = None
def readchunk(v):
# code below will be executed on executor - in another python process on remote server
# original value for _h5file (None) is sent from driver
# and on executor is updated to h5.File object when the `readchunk` is called for the first time
global _h5file
if _h5file is None:
_h5file = h5.File(h5file_path)
return _h5file['/mydata'][v,:]
foo = sc.parallelize(range(0,100)).map(lambda v: readchunk(v))
foo.count()