I have a network of 100 machines, all running Ubuntu Linux.
On a continuous (streaming) basis, machine X is 'fed' with some real-time data. I need to write a python script that would get the data as input, load it in-memory, process it, and then save it to disk.
It's a lot of data, hence, I would ideally want to split the data in memory (using some logic) and just send pieces of it to each individual computer, in the fastest possible way. each individual computer will accept its piece of data, handle it and write it to its local disk.
Suppose I have a container of data in Python (be it a list, a dictionary etc), already processed and split to pieces. What is the fastest way to send each 'piece' of data to each individual machine?
You should take a look at pyzmq:
http://www.zeromq.org/bindings:python
and general guides to zeromq (0mq)
http://nichol.as/zeromq-an-introduction
http://www.zeromq.org/
You have two (classes of) choices:
You could build some distribution mechanism yourself.
You could use an existing tool to handle the distribution and storage.
In the simplest case, you write a program on each machine in your network that simply listens, processes and writes. You distribute from X to each machine in your pool round-robin. But, you might want to address higher-level concerns like handling node failures or dealing with requests that take longer to process than others, adding new nodes to the system, etc.
As you want more functionality, you'll probably want to find some existing tool to help you. It sounds like you might want to investigate some combinations of AMQP (for reliable messaging), Hadoop (for distributed data processing) or more complete NoSQL solutions like Cassandra or Riak. By leveraging these tools, your system will be significantly more robust than what you could probably build out yourself.
What you want is a message queue like RabbitMQ. It is easy to add consumers and producers to a queue. Consumer can either poll or get notified through a callback...
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm reforming a more concise version of my question here. Got flagged for being too broad.
I'm looking for a way, either native python or a framework, which will allow me to do the following:
Publish a webservice which an end customer can call like any other standard webservice (using curl, postman, requests, etc.)
This webservice will be accepting gigabytes (perhaps 10s of GB) of data per call.
While this data is being transmitted, I'd like to break it into chunks and spin off separate threads and/or processes to simultaneously work with it (my processing is complex but each chunk will be independent and self-contained)
Doing this will allow my logic to be running in parallel with the data upload across the internet, and avoid wasting all that time while the data is just being transmitted
It will also prevent the gigabytes/10s GB to be put all into RAM before my logic even begins.
Original Question:
I'm trying to build a web service (in Python) which can accept potentially tens of gigabytes of data and process this data. I don't want this to be completely received and built into an in-memory object before passing to my logic as a) this will use a ton of memory, and b) the processing will be pretty slow and I'd love to have a processing thread working on chunks of the data while the rest of the data is being received asynchronously.
I believe I need some sort of streaming solution for this but I'm having trouble finding any Python solution to handle this case. Most things I've found are about streaming the output (not an issue for me). Also it seems like wsgi has issues by design with a data streaming solution.
Is there a best practice for this sort of issue which I'm missing? And/or, is there a solution that I haven't found?
Edit: Since a couple of people asked, here's an example of the sort of data I'd be looking at. Basically I'm working with lists of sentences, which may be millions of sentences long. But each sentence (or group of sentences, for ease) is a separate processing task. Originally I had planned on receiving this as a json array like:
{"sentences: [
"here's a sentence",
"here's another sentence",
"I'm also a sentence"
]
}
For this modification I'm thinking it would just be newline delimited sentences, since I don't really need the json structure. So in my head, my solution would be; I get a constant stream of characters, and whenever I get a newline character, I'd split off the previous sentence and pass it to a worker thread or threadpool to do my processing. I could also do in groups of many sentences to avoid having a ton of threads going at once. But the main thing is, while the main thread is getting this character stream, it is splitting off tasks periodically so other threads can start the processing.
Second Edit: I've had a few thoughts on how to process the data. I can't give tons of specific details as it's proprietary, but I could either store the sentences as they come in into ElasticSearch or some other database, and have an async process working on that data, or (ideally) I'd just work with the sentences (in batches) in memory. Order is important, and also not dropping any sentences is important. The inputs will be coming from customers over the internet though, so that's why I'm trying to avoid a message queue like process, so there's not the overhead of a new call for each sentence.
Ideally, the customer of the webservice doesn't have to do anything particularly special other than do the normal POST request with a gigantic body, and all this special logic is server-side. My customers won't be expert software engineers so while a webservice call is perfectly within their wheelhouse, handling a more complex message queue process or something along those lines isn't something I want to impose on them.
Unless you share a little more of the type of data, processing or what other constraints your problem has, it's going to be very difficult to provide more tailored advice than maybe pointing you to a couple resources.
... Here is my attempt, hope it helps!
It seems like what you need is the following:
A message passing system vs streaming system in order to deliver/receive the data
Optionally, an asynchronous task queue to fire up different processing tasks on the data
or even a custom data processing pipeline system
Messaging vs Streaming
Examples: RabbitMQ, Kombu (per #abolotnov's comment), Apache Kafka (and python ports), Faust
The main differences between messaging and streaming can vary on the system/definition/who you ask, but in general:
- messaging: a "simple" system that will take care of sending/receiving single messages between two processes
- streaming adds functionality like the ability to "replay", send mini-batches of groups of messages, process rolling windows, etc.
Messaging systems may implement as well broadcasting (send message to all receivers) and publish/subscribe scenarios, that would come handy if you don't want your publisher (creator of data) to keep track of who to send the data to (subscribers), or alternatively your subscribers to keep track who and when to go and get the data from.
Asynchronous task queue
Examples: Celery, RQ, Taskmaster
This will basically help you assign a set of tasks that may be the smaller chunks of the main processing you are intending to do, and then make sure these tasks get performed whenever new data pops up.
Custom Data Processing Systems
I mainly have one in mind: Dask (official tutorial repo)
This is a system very much created for what seems to me you have in your hands. Basically large amounts of information emerging from some source (that may or not be fully under your control), that need to flow through a set of processing steps in order to be consumable by some other process (or stored).
Dask is kind of a combination of the previous, in that you define a computation graph (or task graph) with data sources and computation nodes that connect and some may depend on other nodes. Later, and dependent on the system you deploy it on, you can specify for sync or different types of async in which the tasks will be able to be executed, but keeping this run-time implementation detail separate from the actual tasks to be performed. This means, you could deploy on your computer, but later decide to deploy the same pipeline on a cluster, and you would only need to change the "settings" of this run-time implementation.
Additionally, Dask basically imitates numpy / pandas / pyspark or whatever data processing framework you may be already using, so the syntax will be (almost in every case) virtually the same.
I'm working on a robot that uses a CNN that needs much more memory than my embedded computer (Jetson TX1) can handle. I was wondering if it would be possible (with an extremely low latency connection) to outsource the heavy computations to EC2 and send the results back to the be used in a Python script. If this is possible, how would I go about it and what would the latency look like (not computations, just sending to and from).
I think it's certainly possible. You would need some scripts or a web server to transfer data to and from. Here is how I think you might achieve it:
Send all your training data to an EC2 instance
Train your CNN
Save the weights and/or any other generated parameters you may need
Construct the CNN on your embedded system and input the weights
from the EC2 instance. Since you won't be needing to do any training
here and won't need to load in the training set, the memory usage
will be minimal.
Use your embedded device to predict whatever you may need
It's hard to give you an exact answer on latency because you haven't given enough information. The exact latency is highly dependent on your hardware, internet connection, amount of data you'd be transferring, software, etc. If you're only training once on an initial training set, you only need to transfer your weights once and thus latency will be negligible. If you're constantly sending data and training, or doing predictions on the remote server, latency will be higher.
Possible: of course it is.
You can use any kind of RPC to implement this. HTTPS requests, xml-rpc, raw UDP packets, and many more. If you're more interested in latency and small amounts of data, then something UDP based could be better than TCP, but you'd need to build extra logic for ordering the messages and retrying the lost ones. Alternatively something like Zeromq could help.
As for the latency: only you can answer that, because it depends on where you're connecting from. Start up an instance in the region closest to you and run ping, or mtr against it to find out what's the roundtrip time. That's the absolute minimum you can achieve. Your processing time goes on top of that.
I am a former employee of CENAPAD-UFC (National Centre of HPC, Federal University of CearĂ¡), so I have something to say about outsourcing computer power.
CENAPAD has a big cluster, and it provides computational power for academic research. There, professors and students send their computation and their data, defined the output and go drink a coffee or two, while the cluster go on with the hard work. After lots of flops, the operation ended and they retrieve it via ssh and go back to their laptops.
For big chunks of computation, you wish to minimize any work that is not a useful computation. One such thing is commumication over detached computers. If you need to know when the computation has ended, let the HPC machine tell you that.
To compute stuff effectively, you may want to go deeper in the machine and performe some kind of distribution. I use OpenMP to distribute computation inside the same machine/thread distribution. To distribute between physically separated computers, but next (latency speaking), I use MPI. I have installed also another cluster in UFC for another department. There, the researchers used only MPI.
Maybe some read about distributed/grid/clusterized computing helps you:
https://en.m.wikipedia.org/wiki/SETI#home ; the first example of massive distributed computing that I ever heard
https://en.m.wikipedia.org/wiki/Distributed_computing
https://en.m.wikipedia.org/wiki/Grid_computing
https://en.m.wikipedia.org/wiki/Supercomputer ; this is CENAPAD like stuff
In my opinion, you wish to use a grid-like computation, with your personal PC working as a master node that may call the EC2 slaves; in this scenario, just use communication from master to slave to send program (if really needed) and data, in such a way that the master will have another thing to do not related with the sent data; also, let the slave tells your master when the computation reached it's end.
I am intended to make a program structure like below
PS1 is a python program persistently running. PC1, PC2, PC3 are client python programs. PS1 has a variable hashtable, whenever PC1, PC2... asks for the hashtable the PS1 will pass it to them.
The intention is to keep the table in memory since it is a huge variable (takes 10G memory) and it is expensive to calculate it every time. It is not feasible to store it in the hard disk (using pickle or json) and read it every time when it is needed. The read just takes too long.
So I was wondering if there is a way to keep a python variable persistently in the memory, so it can be used very fast whenever it is needed.
You are trying to reinvent a square wheel, when nice round wheels already exist!
Let's go one level up to how you have described your needs:
one large data set, that is expensive to build
different processes need to use the dataset
performance questions do not allow to simply read the full set from permanent storage
IMHO, we are exactly facing what databases were created for. For common use cases, having many processes all using their own copy of a 10G object is a memory waste, and the common way is that one single process have the data, and the others send requests for the data. You did not describe your problem enough, so I cannot say if the best solution will be:
a SQL database like PostgreSQL or MariaDB - as they can cache, if you have enough memory, all will be held automatically in memory
a NOSQL database (MongoDB, etc.) if your only (or main) need is single key access - very nice when dealing with lot of data requiring fast but simple access
a dedicated server using a dedicate query languages if your needs are very specific and none of the above solutions meet them
a process setting up a huge piece of shared memory that will be used by client processes - that last solution will certainly be fastest provided:
all clients make read-only accesses - it can be extended to r/w accesses but could lead to a synchronization nightmare
you are sure to have enough memory on your system to never use swap - if you do you will lose all the cache optimizations that real databases implement
the size of the database and the number of client process and the external load of the whole system never increase to a level where you fall in the swapping problem above
TL/DR: My advice is to experiment what are the performances with a good quality database and optionaly a dedicated chache. Those solution allow almost out of the box load balancing on different machines. Only if that does not work carefully analyze the memory requirements and be sure to document the limits in number of client processes and database size for future maintenance and use shared memory - read-only data being an hint that shared memory can be a nice solution
In short, to accomplish what you are asking about, you need to create a byte array as a RawArray from the multiprocessing.sharedctypes module that is large enough for your entire hashtable in the PS1 server, and then store the hashtable in that RawArray. PS1 needs to be the process that launches PC1, PC2, etc., which can then inherit access to the RawArray. You can create your own class of object that provides the hashtable interface through which the individual variables in the table are accessed that can be separately passed to each of the PC# processes that reads from the shared RawArray.
Briefly, octopy and mincemeatpy are python implementations of map-reduce (light-weight), and clients can join the cluster in ad-hoc manner without requiring any installations (Of-course, except python). Here are the project details OCTOPY and Mincemeatpy.
The problem with these is they need to hold the entire data in-memory (including intermediate key-value pairs). So even for a moderate size data, they throw out of memory exceptions.
The key-reasons I'm using them are:
Python.
No cluster installation required.
I just prototype, and I can directly port the algorithm once I'm ready.
So my question is: Is there any package which handles the same stuff, but not just in-memory (which can handle moderate size data) ?
Try PyMapReduce. It runs on your own machine, but on several processes - so you don't need to build up master-node architecture and it have plenty of runners, for example DiskBasedRunner, which seems to store map data to temp files and after reduces them.
Suppose that one is interested to write a python app where there should be communication between different processes. The communications will be done by sending strings and/or numpy arrays.
What are the considerations to prefer OpenMPI vs. a tool like RabbitMQ?
There is no single correct answer to such question. It all depends on a big number of different factors. For example:
What kind of communications do you have? Are you sending large packets or small packets, do you need good bandwidth or low latency?
What kind of delivery guarantees do you need?
OpenMPI can instantly deliver messages only to a running process, while different MQ solutions can queue messages and allow fancy producer-consumer configurations.
What kind of network do you have? If you are running on the localhost, something like ZeroMQ would probably be the fastest. If you are running on the set of hosts, depends on the interconnections available. E.g. OpenMPI can utilize infiniband/mirynet links.
What kind of processing are you doing? With MPI all processes are usually started at the same time, do the processing and terminate all at once.
This is exactly the scenario I was in a few months ago and I decided to use AMQP with RabbitMQ using topic exchanges, in addition to memcache for large objects.
The AMQP messages are all strings, in JSON object format so that it is easy to add attributes to a message (like number of retries) and republish it. JSON objects are a subset of JSON that correspond to Python dicts. For instance {"recordid": "272727"} is a JSON object with one attribute. I could have just pickled a Python dict but that would have locked us into only using Python with the message queues.
The large objects don't get routed by AMQP, instead they go into a memcache where they are available for another process to retrieve them. You could just as well use Redis or Tokyo Tyrant for this job. The idea is that we did not want short messages to get queued behind large objects.
In the end, my Python processes ended up using both AMQP and ZeroMQ for two different aspects of the architecture. You may find that it makes sense to use both OpenMPI and AMQP but for different types of jobs.
In my case, a supervisor process runs forever, starts a whole flock of worker who also run forever unless they die or hang, in which case the supervisor restarts them. The work constantly flows in as messages via AMQP, and each process handles just one step of the work, so that when we identify a bottleneck we can have multiple instances of the process, possibly on separate machines, to remove the bottleneck. In my case, I have 15 instances of one process, 4 of two others, and about 8 other single instances.