How can one write lock a zarr store during append? - python

Is there some way to lock a zarr store when using append?
I have already found out the hard way that using append with multiple processes is a bad idea (the batches to append aren't aligned with the batch size of the store). The reason I'd like to use multiple processes is because I need to transform the original arrays before appending them to the zarr store. It would be nice to be able to block other processes from writing concurrently but still perform the transformations in parallel, then append their data in series.
Edit:
Thanks to jdehesa's suggestion, I became aware of the synchronization part of the documentation. I passed a ProcessSynchronizer pointing to a folder on disk to my array at creation in the main thread, then spawned a bunch of worker processes with concurrent.futures and passed the array to all the workers for them to append their results. I could see that the ProcessSynchronizer did something, as the folder I pointed it to filled with files, but the array that my workers write to ended up missing rows (compared to when written from a single process).

Related

How to dynamically share input data to parallel processes in Python

I have several parallel processes working on the same data. The said data is composed of 100.000+ arrays (all stored in an HDF5 file, 90.000 values per array).
For now, each process accesses the data individually, and it works well since the HDF5 file support concurrent reading... but only up to a certain number of parallel processes. Above 14-16 processes accessing the data, I see a drop in efficiency. I was expecting it (too many I/O operations on the same file I reckon), but I don't know how to correct this problem properly.
Since the processes all use the same data, the best would be for the main process to read the file, load the array (or a batch of arrays), and feed it to the running parallel processes, without needing to stop them. A kind of dynamic shared memory if you will.
Is there any way to do it properly and solve my scalability issue?
I use the native "multiprocessing" Python library.
Thanks,
Victor

Fastest way to share a very large dict between multiple processes without copying

TL;DR: How to share a large (200MB) read only dict between multiple processes in a performant way, that is accessed VERY heavily without each process having a full copy in memory.
EDIT: It looks like if I just pass the dictionary as the argument for the multiprocessing.Pool/Process, it won't actually create a copy unless a worker modifies the dictionary. I just assumed it would copy. This behavior seems to be Unix only where fork is available and even then not always. But if so, it should solve my problem until this is converted to an ETL job.
What I'm trying to do:
I have a task to improve a script that replicates data from one store to another. Normalizing and transforming the data on the way. This task works on the scale of around 100 million documents coming from the source document store that get rolled up and pushed to another destination document store.
Each document has an ID and there is another document store is that essentially a key value store of those ID's mapped to some additional information needed for this task. This store is a lot smaller and doing queries against it while document from the main store come through, is not really an option without heavy caching and that heavy cache ends up being a copy of the whole thing very quickly. I just create the whole dictionary dictionary from that entire store at beginning before starting anything and use that. That dictionary is around ~200MB in size. Note that this dictionary is only ever read from.
For this I have setup multiprocessing and have around 30 concurrent processes. I've divided the work for each process such that each hit a different indices and can do the whole thing in around 4 hours.
I have noticed that I am extremely CPU bound when doing the following 2 things:
Using a thread pool/threads (what i'm currently doing) so each thread can access the dict without issue. The GIL is killing me and I have one process maxing out at 100% all the time with other CPU's sitting idle. Switching to PyPy helped a lot, but i'm still not happy with this approach.
Creating a Multiprocessing.Manager().dict() for the large dict and having the child processes access through that. The server process that this approach creates is constantly at 100% cpu. I don't know why, as I only ever read from this dictionary so I doubt it's a locking thing. I don't know how the Manager works internally but i'm guessing that the child processes are connecting via Pipes/Sockets for each fetch and the overhead of this is massive. It also suggests that using Reddis/Memcache will have the same problem if true. Maybe it can be configured better?
I am Memory bound when doing these things:
Using a SharedMemory view. You can't seem to do this for dicts like I need to. I can serialize the dict to get into the shared view, but for it to be usable on the Child process you need serialize the data to an actual usable dict which creates the copy in the process.
I strongly suspect that unless I've missed something I'm just going to have to "download more ram" or rewrite from Python into something without a GIL (or use ETL like it should be done in...).
In the case of ram, what is the most efficient way to store a dict like this to make it sting less? It's currently a standard dict mapped to a tuple of the extra information consisting of 3 long/float.
doc_to_docinfo = {
"ID1": (5.2, 3.0, 455),
}
Are there any more efficient hashmap implementations for this use case than what i'm doing?
You seem to have a similar problem that I have. It is possible to use my source here to create a partitioning of those dictionary-keys per thread. My suggestion: Split the document IDs into partitions of length 3 or 4, keep the partition table in sync for all processes/threads and then just move the parts of your documents to each process/thread and as an entrypoint the process does a dictionary lookup and finds out which process can handle the part of that dictionary. If you are clever with balancing the partitions, you could also have an equal amount of documents per thread managed.

Python - How to share large dataset among multiple processes?

I need to read a large dataset (about 25GB of images) into memory and read it from multiple processes. None of the processes has to write, only read. All the processes are started using Python's multiprocessing module, so they have the same parent process. They train different models on the data and run independently of each other. The reason why I want to read it only one time rather than in each process is that the memory on the machine is limited.
I have tried using Redis, but unfortunately it is extremely slow when many processes read from it. Is there another option to do this?
Is it maybe somehow possible to have another process that only serves as a "get the image with ID x" function? What Python module would be suited for this? Otherwise, I was thinking about implementign a small webserver using werkzeug or Flask, but I am not sure if that would become my new bottleneck then...
Another possibility that came to my mind was to use threads instead of processes, but since Python is not really doing "real" multithreading, this would probably become my new bottleneck.
If you are on linux and the content is read-only, you can use the linux fork inheriting mechanism.
from mp documentation:
Better to inherit than pickle/unpickle
When using the spawn or forkserver start methods many types from
multiprocessing need to be picklable so that child processes can use
them. However, one should generally avoid sending shared objects to
other processes using pipes or queues. Instead you should arrange the
program so that a process which needs access to a shared resource
created elsewhere can inherit it from an ancestor process.
which means:
Before you fork your child processes, prepare your big data in a module level variable (global to all the functions).
Then in the same module, run your child with multiprocessing in 'fork' mode set_start_method('fork').
using this the sub-processes will see this variable without copying it. This happens due to linux forking mechanism that creates child processes with the same memory mapping as the parent (see "copy on write").
I'd suggest mmapping the files, that way they can be shared across multiple processes as well as getting swapped in/out as appropriate
the details of this would depend on what you mean by "25GB of images" and how these models want to access the images
the basic idea would be to preprocess the images into an appropriate format (e.g. one big 4D uint8 numpy array or maybe smaller ones, indicies could be (image, row, column, channel)) and save them in a format where they can be efficiently used by the models. see numpy.memmap for some examples of this
I'd suggest preprocessing files into a useful format "offline", i.e. not part of the model training but a seperate program that is run first. as this would probably take a while and you'd probably not want to do it every time

Multiprocessing and numpy, online processing while avoiding copying

There are lots of questions about using multiprocessing with numpy and sharing arrays. But it seems to me like they all have the luxury of having the data available when the application starts which means it can be memory mapped without too much difficulty.
In contrast, I am trying to build a framework in which data is generated and then processed.
Basically I have a pipeline that looks something like this
Source --> Filter --> Destination
| | |
| | |
------------------------------> Controller / GUI
The source emits new data, which in my case are images from e.g. a video stream (stored as numpy.ndarray instances). The filter does calculations on the data and the destination does further calculation.
The Controller/GUI is just to be able to show previews, current progress, etc.
My current design is to make Source, Filter and Destination multiprocessing.Process instances, and then I have multiprocessing.Queue instances that connects the processes.
But using Queues (or Pipes) for sharing data means the data is copied at each step. If possible, it would be nice to avoid these copies since I am quite sure (not measured yet though) that this lowers the performance.
Is there any reasonable way to avoid this?
Edit, random thoughts on possible solution
I guess what I really want is some kind of shared memory pool where I can store images and then just pass references to the processes.
Example:
Source produces an image and stores it in the shared memory pool at position k.
Source sends "There is a new image at location k" to Filter
One of two things:
Filter decides that the image is bad and instructs the shared memory pool to remove the image at position k.
or Filter decides that the image is ok and sends the "There is a filtered image at location k" to Destination.
I am not sure how difficult this would be to implement though, if anyone already has, or if it is indeed the best answer.
I'd like to hear your opinions.
You might have success storing your numpy data in one or more multiprocessing.Managers. You'd still have to deal with the performance implications of serializing communication back and forth with the manager, but if you were careful, you could at least likely avoid having to pump the entire data structure across pickle/unpickle.
Edit:
Maybe you're looking at the problem the wrong way. Instead of trying to pass data from one processing entity to the next, create a worker for each data entity, then perform all the calculations for that entity in that one worker. Could you structure your code so that Source could just spawn a worker that would be responsible for actually generating and filtering the images, then have the worker notify Destination when it is done that new data is ready for presentation? You'd still have to serialize some sort of token from Source to the workers, and the final data from the workers to the Destination, but you'd be able to get rid of at least 1 or 2 handoffs, and might be able to figure out a more efficient way to serialize what's left, since you'd only have to serialize the part of the data relevant to the Destination.

updating a shelve dictionary in python parallely

I have a program that takes a very huge input file and makes a dict out of it. Since there is no way this is going to fit in memory, I Decided to use shelve to write it to my disk. Now I need to take advantage of the multiple cores available in my system (8 of them) so that I can speed up my parsing. The most obvious way to do this I thought was to split my input file into 8 parts and run the code on all 8 parts concurrently. The problem is that I need only 1 dictionary in the end. Not 8 of them. So how do I use shelve to update one single dictionary parallely?
I gave a pretty detailed answer here on Processing single file from multiple processes in python
Don't try to figure out how you can have many processes write to a shelve at once. Think about how you can have a single process deliver results to the shelve.
The idea is that you have a single process producing the input to a queue. Then you have as many workers as you want receiving queued items and doing the work. When they are done, they place the result into a result queue for the sink to read. The benefit is that you do not have to manually split up your work ahead of time. Just produce the "input" and let whatever worker is read take it and work on it.
With this pattern, you can scale up or down the workers based on the system capabilities.
shelve doesn't support concurrent access. There are a few options for accomplishing what you want:
Make one shelf per process and then merge at the end.
Have worker processes send their results back to the master process over eg multiprocessing.Pipe; the master then stores them in the shelf.
I think you can get bsddb to work with concurrent access in a shelve-like API, but I've never had the need to do so.

Categories