I have several parallel processes working on the same data. The said data is composed of 100.000+ arrays (all stored in an HDF5 file, 90.000 values per array).
For now, each process accesses the data individually, and it works well since the HDF5 file support concurrent reading... but only up to a certain number of parallel processes. Above 14-16 processes accessing the data, I see a drop in efficiency. I was expecting it (too many I/O operations on the same file I reckon), but I don't know how to correct this problem properly.
Since the processes all use the same data, the best would be for the main process to read the file, load the array (or a batch of arrays), and feed it to the running parallel processes, without needing to stop them. A kind of dynamic shared memory if you will.
Is there any way to do it properly and solve my scalability issue?
I use the native "multiprocessing" Python library.
Thanks,
Victor
Related
it is possible to read and run a data cleaning and transformation pipeline (all with Pandas) on 4 different CSV files in parallel (one per each thread)? RAM would not be a problem because I would read and transform data in chunks, then persist it to a DB. There would not be any shared state between the threads and the threads would not lock on the DB I/O of DataFrame chunks persistance because the inserts will be async using AioPg. I know how to write the code, but I am not very experienced with Python and I just want to know if the GIL would be relevant/problematic in this case and if it wouldn’t be faster to just do it sequentially. Thanks!
Is there some way to lock a zarr store when using append?
I have already found out the hard way that using append with multiple processes is a bad idea (the batches to append aren't aligned with the batch size of the store). The reason I'd like to use multiple processes is because I need to transform the original arrays before appending them to the zarr store. It would be nice to be able to block other processes from writing concurrently but still perform the transformations in parallel, then append their data in series.
Edit:
Thanks to jdehesa's suggestion, I became aware of the synchronization part of the documentation. I passed a ProcessSynchronizer pointing to a folder on disk to my array at creation in the main thread, then spawned a bunch of worker processes with concurrent.futures and passed the array to all the workers for them to append their results. I could see that the ProcessSynchronizer did something, as the folder I pointed it to filled with files, but the array that my workers write to ended up missing rows (compared to when written from a single process).
I need to read a large dataset (about 25GB of images) into memory and read it from multiple processes. None of the processes has to write, only read. All the processes are started using Python's multiprocessing module, so they have the same parent process. They train different models on the data and run independently of each other. The reason why I want to read it only one time rather than in each process is that the memory on the machine is limited.
I have tried using Redis, but unfortunately it is extremely slow when many processes read from it. Is there another option to do this?
Is it maybe somehow possible to have another process that only serves as a "get the image with ID x" function? What Python module would be suited for this? Otherwise, I was thinking about implementign a small webserver using werkzeug or Flask, but I am not sure if that would become my new bottleneck then...
Another possibility that came to my mind was to use threads instead of processes, but since Python is not really doing "real" multithreading, this would probably become my new bottleneck.
If you are on linux and the content is read-only, you can use the linux fork inheriting mechanism.
from mp documentation:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
which means:
Before you fork your child processes, prepare your big data in a module level variable (global to all the functions).
Then in the same module, run your child with multiprocessing in 'fork' mode set_start_method('fork').
using this the sub-processes will see this variable without copying it. This happens due to linux forking mechanism that creates child processes with the same memory mapping as the parent (see "copy on write").
I'd suggest mmapping the files, that way they can be shared across multiple processes as well as getting swapped in/out as appropriate
the details of this would depend on what you mean by "25GB of images" and how these models want to access the images
the basic idea would be to preprocess the images into an appropriate format (e.g. one big 4D uint8 numpy array or maybe smaller ones, indicies could be (image, row, column, channel)) and save them in a format where they can be efficiently used by the models. see numpy.memmap for some examples of this
I'd suggest preprocessing files into a useful format "offline", i.e. not part of the model training but a seperate program that is run first. as this would probably take a while and you'd probably not want to do it every time
I'm trying to load a ~67 gb dataframe (6,000,000 features by 2300 rows) into dask for machine learning. I'm using a 96 core machine on AWS that I wish to utilize for the actual machine learning bit. However, Dask loads CSVs in a single thread. It has already taken a full 24 hours and it hasn't loaded.
#I tried to display a progress bar, but it is not implemented on dask's load_csv
from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()
df = dd.read_csv('../Larger_than_the_average_CSV.csv')
Is there a faster way to load this into Dask and make it persistent? Should I switch to a different technology (Spark on Scala or PySpark?)
Dask is probably still loading it as I can see a steady 100% CPU utilization in top.
The code you show in the question probably takes no time at all, because you are not actually loading anything, just setting up the job prescription. How long this takes will depend on the chunksize you specify.
There are two main bottlenecks to consider for actual loading:
getting the data from disc into memory, raw data transfer over a single disc interface,
parsing that data into in-memory stuff
There is not much you can do about the former if you are on a local disc, and you would expect it to be a small fraction.
The latter may suffer from the GIL, even though dask will execute in multiple threads by default (which is why it may appear only one thread is being used). You would do well to read the dask documentation about the different schedulers, and should try using the Distributed scheduler, even though you are on a single machine, with a mix of threads and processes.
Finally, you probably don't want to "load" the data at all, but process it. Yes, you can persist into memory with Dask if you wish (dask.persist, funnily), but please do not use many workers to load the data just so you then make it into a Pandas dataframe in your client process memory.
Does hdf5 support parallel writes to the same file, from different threads or from different processes? Alternatively, does hdf5 support non-blocking writes?
If so then is this also supported by NetCDF4, and by the python bindings for either?
I am writing an application where I want different CPU cores to concurrently compute output intended for non-overlapping tiles of a very large output array. (Later I will want to read sections from it as a single array, without needing my own driver to manage indexing many separate files, and ideally without the additional IO task of rearranging it on disk.)
Not trivially, but there various potential work-arounds.
The ordinary HDF5 library apparently does not even support concurrent reading of different files by multiple threads. Consequently NetCDF4, and the python bindings for either, will not support parallel writing.
If the output file is pre-initialised and has chunking and compression disabled, to avoid having a chunk index, then (in principle) concurrent non-overlapping writes to the same file by separate processes might work(?).
In more recent versions of HDF5, there should be support for virtual datasets. Each process would write output to a different file, and afterward a new container file would be created, consisting of references to the individual data files (but otherwise able to be read like a normal HDF5 file).
There exists a "Parallel HDF5" library for MPI. Although MPI might otherwise seem like overkill, it would have advantages if scaling up later to multiple machines.
If writing output is not a performance bottleneck, a multithreaded application could probably implement one output thread (utilising some form of queue data-structure).
[Edit:] Another option is to use zarr format instead, which places each chunk in a separate file (an approach which future versions of HDF currently seem likely to adopt).
If you are running in AWS, checkout HDF Cloud: https://www.hdfgroup.org/solutions/hdf-cloud.
This is a service that enables multiple reader/multiple writer workflows and is largely feature compatible with the HDF5 library.
The client SDK doesn't support non-blocking writes, but of course if you are using the REST API directly you could do non-blocking I/O just like you would with any http-based service.