There are lots of questions about using multiprocessing with numpy and sharing arrays. But it seems to me like they all have the luxury of having the data available when the application starts which means it can be memory mapped without too much difficulty.
In contrast, I am trying to build a framework in which data is generated and then processed.
Basically I have a pipeline that looks something like this
Source --> Filter --> Destination
| | |
| | |
------------------------------> Controller / GUI
The source emits new data, which in my case are images from e.g. a video stream (stored as numpy.ndarray instances). The filter does calculations on the data and the destination does further calculation.
The Controller/GUI is just to be able to show previews, current progress, etc.
My current design is to make Source, Filter and Destination multiprocessing.Process instances, and then I have multiprocessing.Queue instances that connects the processes.
But using Queues (or Pipes) for sharing data means the data is copied at each step. If possible, it would be nice to avoid these copies since I am quite sure (not measured yet though) that this lowers the performance.
Is there any reasonable way to avoid this?
Edit, random thoughts on possible solution
I guess what I really want is some kind of shared memory pool where I can store images and then just pass references to the processes.
Example:
Source produces an image and stores it in the shared memory pool at position k.
Source sends "There is a new image at location k" to Filter
One of two things:
Filter decides that the image is bad and instructs the shared memory pool to remove the image at position k.
or Filter decides that the image is ok and sends the "There is a filtered image at location k" to Destination.
I am not sure how difficult this would be to implement though, if anyone already has, or if it is indeed the best answer.
I'd like to hear your opinions.
You might have success storing your numpy data in one or more multiprocessing.Managers. You'd still have to deal with the performance implications of serializing communication back and forth with the manager, but if you were careful, you could at least likely avoid having to pump the entire data structure across pickle/unpickle.
Edit:
Maybe you're looking at the problem the wrong way. Instead of trying to pass data from one processing entity to the next, create a worker for each data entity, then perform all the calculations for that entity in that one worker. Could you structure your code so that Source could just spawn a worker that would be responsible for actually generating and filtering the images, then have the worker notify Destination when it is done that new data is ready for presentation? You'd still have to serialize some sort of token from Source to the workers, and the final data from the workers to the Destination, but you'd be able to get rid of at least 1 or 2 handoffs, and might be able to figure out a more efficient way to serialize what's left, since you'd only have to serialize the part of the data relevant to the Destination.
Related
TL;DR: How to share a large (200MB) read only dict between multiple processes in a performant way, that is accessed VERY heavily without each process having a full copy in memory.
EDIT: It looks like if I just pass the dictionary as the argument for the multiprocessing.Pool/Process, it won't actually create a copy unless a worker modifies the dictionary. I just assumed it would copy. This behavior seems to be Unix only where fork is available and even then not always. But if so, it should solve my problem until this is converted to an ETL job.
What I'm trying to do:
I have a task to improve a script that replicates data from one store to another. Normalizing and transforming the data on the way. This task works on the scale of around 100 million documents coming from the source document store that get rolled up and pushed to another destination document store.
Each document has an ID and there is another document store is that essentially a key value store of those ID's mapped to some additional information needed for this task. This store is a lot smaller and doing queries against it while document from the main store come through, is not really an option without heavy caching and that heavy cache ends up being a copy of the whole thing very quickly. I just create the whole dictionary dictionary from that entire store at beginning before starting anything and use that. That dictionary is around ~200MB in size. Note that this dictionary is only ever read from.
For this I have setup multiprocessing and have around 30 concurrent processes. I've divided the work for each process such that each hit a different indices and can do the whole thing in around 4 hours.
I have noticed that I am extremely CPU bound when doing the following 2 things:
Using a thread pool/threads (what i'm currently doing) so each thread can access the dict without issue. The GIL is killing me and I have one process maxing out at 100% all the time with other CPU's sitting idle. Switching to PyPy helped a lot, but i'm still not happy with this approach.
Creating a Multiprocessing.Manager().dict() for the large dict and having the child processes access through that. The server process that this approach creates is constantly at 100% cpu. I don't know why, as I only ever read from this dictionary so I doubt it's a locking thing. I don't know how the Manager works internally but i'm guessing that the child processes are connecting via Pipes/Sockets for each fetch and the overhead of this is massive. It also suggests that using Reddis/Memcache will have the same problem if true. Maybe it can be configured better?
I am Memory bound when doing these things:
Using a SharedMemory view. You can't seem to do this for dicts like I need to. I can serialize the dict to get into the shared view, but for it to be usable on the Child process you need serialize the data to an actual usable dict which creates the copy in the process.
I strongly suspect that unless I've missed something I'm just going to have to "download more ram" or rewrite from Python into something without a GIL (or use ETL like it should be done in...).
In the case of ram, what is the most efficient way to store a dict like this to make it sting less? It's currently a standard dict mapped to a tuple of the extra information consisting of 3 long/float.
doc_to_docinfo = {
"ID1": (5.2, 3.0, 455),
}
Are there any more efficient hashmap implementations for this use case than what i'm doing?
You seem to have a similar problem that I have. It is possible to use my source here to create a partitioning of those dictionary-keys per thread. My suggestion: Split the document IDs into partitions of length 3 or 4, keep the partition table in sync for all processes/threads and then just move the parts of your documents to each process/thread and as an entrypoint the process does a dictionary lookup and finds out which process can handle the part of that dictionary. If you are clever with balancing the partitions, you could also have an equal amount of documents per thread managed.
I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.
Having some object that I need shared between two python processes on the same machine , I use python memcached to store objects in a process and use them in the other process. However even if they can be retrieved in the process that wants to use them, calling their methods returns nothing (not the expected result).
Should someone the necessary input, what happens in such a described scenario
- is the object passed entirely and both process share it's state ? , or only a copy is passed and changes made by one process are not visible in the other
- what other techniques / libraries could be employed in order to achieve sharing objects between python processes.
Thanks
Memcached stores string values and not rich objects as you describe.
What you describe sounds like a publisher-subscriber application - one process produces data and the other process acts on it.
Memcached is not the right tool for pub/sub. You need a list that you can atomically add to and pop from. While you can store a string representation of a list in Memcached, you cannot add and remove elements from this list atomically amongst processes.
Have a look at Redis. It has rich data structures that are ideal for pub/sub and mature Python clients to use.
For a simple implementation you could have your one process add (LPUSH) data to a list. The other process could pop (RPOP) data off this list and run a method on it. This would give you a good IPC queue that scales well.
If you need more sophistication then look at the Redis pub/sub docs - it has a messaging system that eliminates the need to poll a queue for new values.
I'm trying to find a reasonable approach in Python for a real-time application, multiprocessing and large files.
A parent process spawn 2 or more child. The first child reads data, keep in memory, and the others process it in a pipeline fashion. The data should be organized into an object,sent to the following process, processed,sent, processed and so on.
Available methodologies such as Pipe, Queue, Managers seem not adequate due to overheads (serialization, etc).
Is there an adequate approach for this?
I've used Celery and Redis for real-time multiprocessing in high memory applications, but it really depends on what you're trying to accomplish.
The biggest benefits I've found in Celery over built-in multiprocessing tools (Pipe/Queue) are:
Low overhead. You call a function directly, no need to serialize data.
Scaling. Need to ramp up worker processes? Just add more workers.
Transparency. Easy to inspect tasks/workers and find bottlenecks.
For really squeezing out performance, ZMQ is my go to. A lot more work to set up and fine-tune, but it's as close to bare sockets as you can safely get.
Disclaimer: This is all anecdotal. It really comes down to what your specific needs are. I'd benchmark different options with sample data before you go down any path.
First, a suspicion that message-passing may be inadequate because of all the overhead is not a good reason to overcomplicate your program. It's a good reason to build a proof of concept and come up with some sample data and start testing. If you're spending 80% of your time pickling things or pushing stuff through queues, then yes, that's probably going to be a problem in your real life code—assuming the amount of work your proof of concept does is reasonably comparable to your real code. But if you're spending 98% of your time doing the real work, then there is no problem to solve. Message passing will be simpler, so just use it.
Also, even if you do identify a problem here, that doesn't mean that you have to abandon message passing; it may just be a problem with what's built in to multiprocessing. Technologies like 0MQ and Celery may have lower overhead than a simple queue. Even being more careful about what you send over the queue can make a huge difference.
But if message passing is out, the obvious alternative is data sharing. This is explained pretty well in the multiprocessing docs, along with the pros and cons of each.
Sharing state between processes describes the basics of how to do it. There are other alternatives, like using mmapped files of platform-specific shared memory APIs, but there's not much reason to do that over multiprocessing unless you need, e.g., persistent storage between runs.
There are two big problems to deal with, but both can be dealt with.
First, you can't share Python objects, only simple values. Python objects have internal references to each other all over the place, the garbage collector can't see references to objects in other processes' heaps, and so on. So multiprocessing.Value can only hold the same basic kinds of native values as array.array, and multiprocessing.Array can hold (as you'd guess by the name) 1D arrays of the same values, and that's it. For anything more complicated, if you can define it in terms of a ctypes.Structure, you can use https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.sharedctypes, but this still means that any references between objects have to be indirect. (For example, you often have to store indices into an array.) (Of course none of this is bad news if you're using NumPy, because you're probably already storing most of your data in NumPy arrays of simple values, which are sharable.)
Second, shared data are of course subject to race conditions. And, unlike multithreading within a single process, you can't rely on the GIL to help protect you here; there are multiple interpreters that can all be trying to modify the same data at the same time. So you have to use locks or conditions to protect things.
For multiprocessing pipeline check out MPipe.
For shared memory (specifically NumPy arrays) check out numpy-sharedmem.
I've used these to do high-performance realtime, parallel image processing (average accumulation and face detection using OpenCV) while squeezing out all available resources from a multi-core CPU system. Check out Sherlock if interested. Hope this helps.
One option is to use something like brain-plasma that maintains a shared-memory object namespace that is independent of the Python process or thread. Kind of like Redis but can be used with big objects and has a simple API, built on top of Apache Arrow.
$ pip install brain-plasma
# process 1
from brain_plasma import Brain
brain = Brain()
brain['myvar'] = 657
# process 2
from brain_plasma import Brain
brain = Brain()
brain['myvar']
# >>> 657
Python 3.8 now offers shared memory access between processes using multiprocessing.shared_memory. All you hand off between processes is a string that references the shared memory block. In the consuming process you get a memoryview object which supports slicing without copying the data like byte arrays do. If you are using numpy it can reference the memory block in an O(1) operation, allowing fast transfers of large blocks of numeric data. As far as I understand generic objects still need to be deserialized since a raw byte array is what's received by the consuming process.
I'm looking to convert a large directory of high resolution images (several million) into thumbnails using Python. I have a DynamoDB table that stores the location of each image in S3.
Instead of processing all these images on one EC2 instance (would take weeks) I'd like to write a distributed application using a bunch of instances.
What techniques could I use to write a queue that would allow a node to "check out" an image from the database, resize it, and update the database with the new dimensions of the generated thumbnails?
Specifically I'm worried about atomicity and concurrency -- how can I prevent two nodes from checking out the same job at the same time with DynamoDB?
One approach you could take would be to use Amazon's Simple Queue Service(SQS) in conjunction with DynamoDB. So what you could do is write messages to the queue that contain something like the hash key of the image entry in DynamoDB. Each instance would periodically check the queue and grab messages off. When an instance grabs a message off the queue, it becomes invisible to other instances for a given amount of time. You can then look up and process the image and delete the message off the queue. If for some reason something goes wrong with processing the image, the message will not be deleted and it will become visible for other instances to grab.
Another, probably more complicated, approach would be to use DynamoDB's conditional update mechanism to implement a locking scheme. For example, you could add something a 'beingProcessed' attribute to your data model, that is either 0 or 1. The first thing an instance could do is perform a conditional update on this column, changing the value to 1 iff the initial value is 0. There is probably more to do here around making it a proper/robust locking mechanism....
Using DynamoDB's optimistic locking with versioning would allow a node to "check out" a job by updating a status field to "InProgress". If a different node tried checking out the same job by updating the status field, it would receive an error and would know to retrieve a different job.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/JavaVersionSupportHLAPI.html
I know this is an old question, so this answer is more for the community than the original poster.
A good/cool approach is to use EMR for this. There is an inter-connection layer in EMR to connect HIVE to DynamoDB. You can then walk through your Table almost as you would with a SQL one and perform your operations.
There is a pretty good guide for it here: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html
It for import/export but can be easily adapted.
Recently, DynamoDB released parallel scan:
http://aws.typepad.com/aws/2013/05/amazon-dynamodb-parallel-scans-and-other-good-news.html
Now, 10 hosts can read from the same table at the same time, and DynamoDB guarantees they won't see the same items.