Python3 shelve items iterator for multiprocessing - python

I am trying to analyse a large shelve with multiprocessing.Pool. Being with the read-only mode it should be thread safe, but it seams that a large object is being read first, then slowwwwly dispatched through the pool. Can this be done more efficiently?
Here is a minimal example of what I'm doing. Assume that test_file.shelf already exist and is large (14GB+). I can see this snipped hugging 20GB of RAM, but only a small part of the shelve can be read at the same time (many more items than processors).
from multiprocessing import Pool
import shelve
def func(key_val):
print(key_val)
with shelve.open('test_file.shelf', flag='r') as shelf,\
Pool(4) as pool:
list(pool.imap_unordered(func, iter(shelf.items()))

Shelves are inherently not fast to open because they work as a dictionary like object and when you open them it takes a bit of time to load, especially for large shelves, due to the way it works in the backend. For it to function as a dictionary like object, each time you get a new item, that item is loaded into a separate dictionary in memory. lib reference
Also from the docs:
(imap) For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
You are using the standard chunk size of 1 which is causing it take a long time to get through your very large shelve file. The docs suggest chunking instead of allowing it to send 1 at a time to speed it up.
The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing. Unix file locking can be used to solve this, but this differs across Unix versions and requires knowledge about the database implementation used.
Last, just as a note, I'm not sure that your assumption about it being safe for multiprocessing is correct out of the box, depending on the implementation.
Edited:
As pointed out by juanpa.arrivillaga, in this answer it describes what is happening in the backend -- your entire iterable may be being consumed up front which causes a large memory usage.

Related

Python - alternatives for internal memory

I'm coding a program that requires high memory usage.
I use python 3.7.10.
During the program I create about 3GB of python objects, modifying them.
Some objects I create contain pointer to other objects.
Also, sometimes I need to deepcopy one object to create another.
My problem is that these objects creation and modification takes a lot of time and causing some performance issues.
I wish I could do some of the creation and modification in parallel. However, there are some limitations:
the program is very CPU-bound and there is almost no usage of IO/network - so multithreading library will not work due to the GIL
the system I work with has no Read-on-write feature- so using multiprocessing python library spend a lot of time on forking the process
the objects do not contain numbers and most of the work in the program are not mathematical - so I cannot benefit from numpy and ctypes
What can be a good alternative for this kind of memory to allow me to parallelize better my code?
Deepcopy is extremely slow in python. A possible solution is to serialize and load the objects from the disk. See this answer for viable options – perhaps ujson and cPickle. Furthermore, you can serialize and deserialize objects asynchronously using aiofiles.
Can't you use your GPU RAM and use CUDA?
https://developer.nvidia.com/how-to-cuda-python
If it doesn't need to be realtime I'd use PySpark (see streaming section https://spark.apache.org/docs/latest/api/python/) and work with remote machines.
Can you tell me a bit about the application? Perhaps you're searching for something like the PyTorch framework (https://pytorch.org/).
You may also like to try using Transparent Huge Pages and a hugepage-aware allocator, such as tcmalloc. That may speed up your application by 5-15% without having to change a line of code.
See thp-usage for more information.

What are the different ways to access really large csv files?

I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible.
I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has benchmarked different ways of accessing csv and the time taken for each step.
He has used a catdevnull process with the code as shown:
def catDevNull():
os.system('cat %s > /dev/null' % fn)
The time taken in this case is the least. I believe it is independent of the python version as the time taken to read the file remains the same. Then he utilizes the warmc ache method as shown:
def wc():
os.system('wc -l %s > /dev/null' % fn)
The above two methods are the fastest. Using pandas.read_csv for the task, the time is less than other methods, but still slower than the above two methods.
Putting x = os.system('cat %s > /dev/null % fn), and checking the data type is a string.
How does os.system read the file that the time is so much less? Also, is there a way to access the files after they are read by os.system for further processing?
I was also curious as to how come reading the file is so much faster in pandas compared to other methods available as shown in the above link?
os.system completely relinquishes the control you have in Python. There is no way to access anything which happened in the subprocess after it has finished.
A better way to have some (but not sufficient) control over a subprocess is to use the Python subprocess module. This allows you to interact with the running process using signals and I/O, but still, there is no way to affect the internals of a process unless it has a specific API for allowing you to do that. (Linux exposes some process internals in the /proc filesystem if you want to explore that.)
I don't think you understand what the benchmark means. The cat >/dev/null is a baseline which simply measures how quickly the system is able to read the file off the disk; your process cannot possibly be faster than the I/O channel permits, so this is the time that the system takes for doing nothing at all. You would basically subtract this time from the subsequent results before you compare their relative performance.
Conventionally, the absolutely fastest way to read a large file is to index it, then use the index in memory to seek to the position inside the file you want to access. Building the index causes some overhead, but if you access the file more than once, the benefits soon cancel out the overhead. Importing your file to a database is a convenient and friendly way to do this; the database encapsulates the I/O completely and lets you query the data as if you could ignore that it is somehow serialized into bytes on a disk behind the scenes.
Based on my testing. I came across the fact that it is a lot faster to query in a pandas dataframe than querying in the database[tested for sqlite3]
Thus, the fastest way is to get the csv as a pandas dataframe, and then query in the dataframe as required. Also, if I need to save the file, I can pickle the dataframe, and reuse it as required. The time to pickle and unpickle file and querying is a lot lesser than storing the data in sql and then querying for the results.

Passing variables between two python processes

I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.

How to share objects and data between python processes in real-time?

I'm trying to find a reasonable approach in Python for a real-time application, multiprocessing and large files.
A parent process spawn 2 or more child. The first child reads data, keep in memory, and the others process it in a pipeline fashion. The data should be organized into an object,sent to the following process, processed,sent, processed and so on.
Available methodologies such as Pipe, Queue, Managers seem not adequate due to overheads (serialization, etc).
Is there an adequate approach for this?
I've used Celery and Redis for real-time multiprocessing in high memory applications, but it really depends on what you're trying to accomplish.
The biggest benefits I've found in Celery over built-in multiprocessing tools (Pipe/Queue) are:
Low overhead. You call a function directly, no need to serialize data.
Scaling. Need to ramp up worker processes? Just add more workers.
Transparency. Easy to inspect tasks/workers and find bottlenecks.
For really squeezing out performance, ZMQ is my go to. A lot more work to set up and fine-tune, but it's as close to bare sockets as you can safely get.
Disclaimer: This is all anecdotal. It really comes down to what your specific needs are. I'd benchmark different options with sample data before you go down any path.
First, a suspicion that message-passing may be inadequate because of all the overhead is not a good reason to overcomplicate your program. It's a good reason to build a proof of concept and come up with some sample data and start testing. If you're spending 80% of your time pickling things or pushing stuff through queues, then yes, that's probably going to be a problem in your real life code—assuming the amount of work your proof of concept does is reasonably comparable to your real code. But if you're spending 98% of your time doing the real work, then there is no problem to solve. Message passing will be simpler, so just use it.
Also, even if you do identify a problem here, that doesn't mean that you have to abandon message passing; it may just be a problem with what's built in to multiprocessing. Technologies like 0MQ and Celery may have lower overhead than a simple queue. Even being more careful about what you send over the queue can make a huge difference.
But if message passing is out, the obvious alternative is data sharing. This is explained pretty well in the multiprocessing docs, along with the pros and cons of each.
Sharing state between processes describes the basics of how to do it. There are other alternatives, like using mmapped files of platform-specific shared memory APIs, but there's not much reason to do that over multiprocessing unless you need, e.g., persistent storage between runs.
There are two big problems to deal with, but both can be dealt with.
First, you can't share Python objects, only simple values. Python objects have internal references to each other all over the place, the garbage collector can't see references to objects in other processes' heaps, and so on. So multiprocessing.Value can only hold the same basic kinds of native values as array.array, and multiprocessing.Array can hold (as you'd guess by the name) 1D arrays of the same values, and that's it. For anything more complicated, if you can define it in terms of a ctypes.Structure, you can use https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.sharedctypes, but this still means that any references between objects have to be indirect. (For example, you often have to store indices into an array.) (Of course none of this is bad news if you're using NumPy, because you're probably already storing most of your data in NumPy arrays of simple values, which are sharable.)
Second, shared data are of course subject to race conditions. And, unlike multithreading within a single process, you can't rely on the GIL to help protect you here; there are multiple interpreters that can all be trying to modify the same data at the same time. So you have to use locks or conditions to protect things.
For multiprocessing pipeline check out MPipe.
For shared memory (specifically NumPy arrays) check out numpy-sharedmem.
I've used these to do high-performance realtime, parallel image processing (average accumulation and face detection using OpenCV) while squeezing out all available resources from a multi-core CPU system. Check out Sherlock if interested. Hope this helps.
One option is to use something like brain-plasma that maintains a shared-memory object namespace that is independent of the Python process or thread. Kind of like Redis but can be used with big objects and has a simple API, built on top of Apache Arrow.
$ pip install brain-plasma
# process 1
from brain_plasma import Brain
brain = Brain()
brain['myvar'] = 657
# process 2
from brain_plasma import Brain
brain = Brain()
brain['myvar']
# >>> 657
Python 3.8 now offers shared memory access between processes using multiprocessing.shared_memory. All you hand off between processes is a string that references the shared memory block. In the consuming process you get a memoryview object which supports slicing without copying the data like byte arrays do. If you are using numpy it can reference the memory block in an O(1) operation, allowing fast transfers of large blocks of numeric data. As far as I understand generic objects still need to be deserialized since a raw byte array is what's received by the consuming process.

Why does python multiprocessing pickle objects to pass objects between processes?

Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.
multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory

Categories