I'm writing a project which writes using the same data a ton of times, and I have been using ray to scale this up in a cluster setting, however the files are too large to send back and forth/save on the ray object store all the time. Is there a way to save the python objects on the local nodes between the calls of the remote functions?
Writing to files always tends to be tricky in distributed systems since regular file systems aren't shared between machines. Ray generally doesn't interfere with the file system, but I think you have a few options here.
Expand the object store size: You can change the plasma store size and where it's stored to a larger file by setting the --object-store-memory and --plasma-directory flags.
Use a distributed file system: Distributed filesystems like NFS allow you to share part of your filesystem across machines. If you manually set up an NFS share, you can direct Ray to write to a file within NFS.
Don't use a filesystem: While this is technically a non-answer, this is arguably the most typical approach to distributed systems. Instead of writing to your filesystem, consider writing to S3 or similar KV store or Blob Store.
Downsides of these approaches:
The biggest downside of (1) is that if you aren't careful, you could badly affect your performance.
The biggest downside of (2) is that it can be slow. In particular, if you need to read and write data from multiple nodes. A secondary downside is that you will have to setup NFS yourself.
The biggest downside to (3) is that you're now relying on an external service, and it arguably isn't a direct solution to your problem.
Related
I want to run two Python processes in parallel, with each being able to send data to and receive data from the other at any time. Python's multiprocessing package seems to have multiple solutions to this, such as Queue, Pipe, and SharedMemory. What are the pros and cons of using each of these, and which one would be best for accomplishing this specific goal?
It comes down to what you want to share, who you want to share it with, how often you want to share it, what your latency requirements are, and your skill-set, your maintainability needs and your preferences. Then there are the usual tradeoffs to be made between performance, legibility, upgradeability and so on.
If you are sharing native Python objects, they will generally be most simply shared via a "multiprocessing queue" because they will be packaged up before transmission and unpackaged on receipt.
If you are sharing large arrays, such as images, you will likely find that "multiprocessing shared memory" has least overhead because there is no pickling involved. However, if you want to share such arrays with other machines across a network, shared memory will not work, so you may need to resort to Redis or some other technology. Generally, "multiprocessing shared memory" takes more setting up, and requires you to do more to synchronise access, but is more performant for larger data-sets.
If you are sharing between Python and C/C++ or another language, you may elect to use protocol buffers and pipes, or again Redis.
As I said, there are many tradeoffs and opinions - far more than I have addressed here. The first thing though is to determine your needs in terms of bandwidth, latency, flexibility and then think about the most appropriate technology.
Does hdf5 support parallel writes to the same file, from different threads or from different processes? Alternatively, does hdf5 support non-blocking writes?
If so then is this also supported by NetCDF4, and by the python bindings for either?
I am writing an application where I want different CPU cores to concurrently compute output intended for non-overlapping tiles of a very large output array. (Later I will want to read sections from it as a single array, without needing my own driver to manage indexing many separate files, and ideally without the additional IO task of rearranging it on disk.)
Not trivially, but there various potential work-arounds.
The ordinary HDF5 library apparently does not even support concurrent reading of different files by multiple threads. Consequently NetCDF4, and the python bindings for either, will not support parallel writing.
If the output file is pre-initialised and has chunking and compression disabled, to avoid having a chunk index, then (in principle) concurrent non-overlapping writes to the same file by separate processes might work(?).
In more recent versions of HDF5, there should be support for virtual datasets. Each process would write output to a different file, and afterward a new container file would be created, consisting of references to the individual data files (but otherwise able to be read like a normal HDF5 file).
There exists a "Parallel HDF5" library for MPI. Although MPI might otherwise seem like overkill, it would have advantages if scaling up later to multiple machines.
If writing output is not a performance bottleneck, a multithreaded application could probably implement one output thread (utilising some form of queue data-structure).
[Edit:] Another option is to use zarr format instead, which places each chunk in a separate file (an approach which future versions of HDF currently seem likely to adopt).
If you are running in AWS, checkout HDF Cloud: https://www.hdfgroup.org/solutions/hdf-cloud.
This is a service that enables multiple reader/multiple writer workflows and is largely feature compatible with the HDF5 library.
The client SDK doesn't support non-blocking writes, but of course if you are using the REST API directly you could do non-blocking I/O just like you would with any http-based service.
I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.
Why does the multiprocessing package for python pickle objects to pass them between processes, i.e. to return results from different processes to the main interpreter process? This may be an incredibly naive question, but why can't process A say to process B "object x is at point y in memory, it's yours now" without having to perform the operation necessary to represent the object as a string.
multiprocessing runs jobs in different processes. Processes have their own independent memory spaces, and in general cannot share data through memory.
To make processes communicate, you need some sort of channel. One possible channel would be a "shared memory segment", which pretty much is what it sounds like. But it's more common to use "serialization". I haven't studied this issue extensively but my guess is that the shared memory solution is too tightly coupled; serialization lets processes communicate without letting one process cause a fault in the other.
When data sets are really large, and speed is critical, shared memory segments may be the best way to go. The main example I can think of is video frame buffer image data (for example, passed from a user-mode driver to the kernel or vice versa).
http://en.wikipedia.org/wiki/Shared_memory
http://en.wikipedia.org/wiki/Serialization
Linux, and other *NIX operating systems, provide a built-in mechanism for sharing data via serialization: "domain sockets" This should be quite fast.
http://en.wikipedia.org/wiki/Unix_domain_socket
Since Python has pickle that works well for serialization, multiprocessing uses that. pickle is a fast, binary format; it should be more efficient in general than a serialization format like XML or JSON. There are other binary serialization formats such as Google Protocol Buffers.
One good thing about using serialization: it's about the same to share the work within one computer (to use additional cores) or to share the work between multiple computers (to use multiple computers in a cluster). The serialization work is identical, and network sockets work about like domain sockets.
EDIT: #Mike McKerns said, in a comment below, that multiprocessing can use shared memory sometimes. I did a Google search and found this great discussion of it: Python multiprocessing shared memory
Briefly, octopy and mincemeatpy are python implementations of map-reduce (light-weight), and clients can join the cluster in ad-hoc manner without requiring any installations (Of-course, except python). Here are the project details OCTOPY and Mincemeatpy.
The problem with these is they need to hold the entire data in-memory (including intermediate key-value pairs). So even for a moderate size data, they throw out of memory exceptions.
The key-reasons I'm using them are:
Python.
No cluster installation required.
I just prototype, and I can directly port the algorithm once I'm ready.
So my question is: Is there any package which handles the same stuff, but not just in-memory (which can handle moderate size data) ?
Try PyMapReduce. It runs on your own machine, but on several processes - so you don't need to build up master-node architecture and it have plenty of runners, for example DiskBasedRunner, which seems to store map data to temp files and after reduces them.