Briefly, octopy and mincemeatpy are python implementations of map-reduce (light-weight), and clients can join the cluster in ad-hoc manner without requiring any installations (Of-course, except python). Here are the project details OCTOPY and Mincemeatpy.
The problem with these is they need to hold the entire data in-memory (including intermediate key-value pairs). So even for a moderate size data, they throw out of memory exceptions.
The key-reasons I'm using them are:
Python.
No cluster installation required.
I just prototype, and I can directly port the algorithm once I'm ready.
So my question is: Is there any package which handles the same stuff, but not just in-memory (which can handle moderate size data) ?
Try PyMapReduce. It runs on your own machine, but on several processes - so you don't need to build up master-node architecture and it have plenty of runners, for example DiskBasedRunner, which seems to store map data to temp files and after reduces them.
Related
I want to run two Python processes in parallel, with each being able to send data to and receive data from the other at any time. Python's multiprocessing package seems to have multiple solutions to this, such as Queue, Pipe, and SharedMemory. What are the pros and cons of using each of these, and which one would be best for accomplishing this specific goal?
It comes down to what you want to share, who you want to share it with, how often you want to share it, what your latency requirements are, and your skill-set, your maintainability needs and your preferences. Then there are the usual tradeoffs to be made between performance, legibility, upgradeability and so on.
If you are sharing native Python objects, they will generally be most simply shared via a "multiprocessing queue" because they will be packaged up before transmission and unpackaged on receipt.
If you are sharing large arrays, such as images, you will likely find that "multiprocessing shared memory" has least overhead because there is no pickling involved. However, if you want to share such arrays with other machines across a network, shared memory will not work, so you may need to resort to Redis or some other technology. Generally, "multiprocessing shared memory" takes more setting up, and requires you to do more to synchronise access, but is more performant for larger data-sets.
If you are sharing between Python and C/C++ or another language, you may elect to use protocol buffers and pipes, or again Redis.
As I said, there are many tradeoffs and opinions - far more than I have addressed here. The first thing though is to determine your needs in terms of bandwidth, latency, flexibility and then think about the most appropriate technology.
I'm writing a project which writes using the same data a ton of times, and I have been using ray to scale this up in a cluster setting, however the files are too large to send back and forth/save on the ray object store all the time. Is there a way to save the python objects on the local nodes between the calls of the remote functions?
Writing to files always tends to be tricky in distributed systems since regular file systems aren't shared between machines. Ray generally doesn't interfere with the file system, but I think you have a few options here.
Expand the object store size: You can change the plasma store size and where it's stored to a larger file by setting the --object-store-memory and --plasma-directory flags.
Use a distributed file system: Distributed filesystems like NFS allow you to share part of your filesystem across machines. If you manually set up an NFS share, you can direct Ray to write to a file within NFS.
Don't use a filesystem: While this is technically a non-answer, this is arguably the most typical approach to distributed systems. Instead of writing to your filesystem, consider writing to S3 or similar KV store or Blob Store.
Downsides of these approaches:
The biggest downside of (1) is that if you aren't careful, you could badly affect your performance.
The biggest downside of (2) is that it can be slow. In particular, if you need to read and write data from multiple nodes. A secondary downside is that you will have to setup NFS yourself.
The biggest downside to (3) is that you're now relying on an external service, and it arguably isn't a direct solution to your problem.
I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.
I'm new to using multiple cpu's to process jobs and was wondering if people could let me know the pro/cons of parallelpython(or any type of python module) versus hadoop streaming?
I have a very large cpu intensive process that I would like to spread across several servers.
Since moving data becomes harder and harder with size; when it comes to parallel computing, data localization becomes very important. Hadoop as a map/reduce framework maximizes the localization of data being processed. It also gives you a way to spread your data efficiently across your cluster (hdfs). So basically, even if you use other parallel modules, as long as you don't have your data localized on the computers you are doing process or as long as you have to move your data across cluster all the time, you wouldn't get maximum benefit from parallel computing. That's one of the key ideas of hadoop.
The main difference is that Hadoop is good at processing big data (dozen to terabytes of data). It provides a simple logical framework called MapReduce which is well appropriate for data aggregation, and a distributed storage system called HDFS.
If your inputs are smaller than 1 gigabyte, you probably don't want to use Hadoop.
I have a network of 100 machines, all running Ubuntu Linux.
On a continuous (streaming) basis, machine X is 'fed' with some real-time data. I need to write a python script that would get the data as input, load it in-memory, process it, and then save it to disk.
It's a lot of data, hence, I would ideally want to split the data in memory (using some logic) and just send pieces of it to each individual computer, in the fastest possible way. each individual computer will accept its piece of data, handle it and write it to its local disk.
Suppose I have a container of data in Python (be it a list, a dictionary etc), already processed and split to pieces. What is the fastest way to send each 'piece' of data to each individual machine?
You should take a look at pyzmq:
http://www.zeromq.org/bindings:python
and general guides to zeromq (0mq)
http://nichol.as/zeromq-an-introduction
http://www.zeromq.org/
You have two (classes of) choices:
You could build some distribution mechanism yourself.
You could use an existing tool to handle the distribution and storage.
In the simplest case, you write a program on each machine in your network that simply listens, processes and writes. You distribute from X to each machine in your pool round-robin. But, you might want to address higher-level concerns like handling node failures or dealing with requests that take longer to process than others, adding new nodes to the system, etc.
As you want more functionality, you'll probably want to find some existing tool to help you. It sounds like you might want to investigate some combinations of AMQP (for reliable messaging), Hadoop (for distributed data processing) or more complete NoSQL solutions like Cassandra or Riak. By leveraging these tools, your system will be significantly more robust than what you could probably build out yourself.
What you want is a message queue like RabbitMQ. It is easy to add consumers and producers to a queue. Consumer can either poll or get notified through a callback...