Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm reforming a more concise version of my question here. Got flagged for being too broad.
I'm looking for a way, either native python or a framework, which will allow me to do the following:
Publish a webservice which an end customer can call like any other standard webservice (using curl, postman, requests, etc.)
This webservice will be accepting gigabytes (perhaps 10s of GB) of data per call.
While this data is being transmitted, I'd like to break it into chunks and spin off separate threads and/or processes to simultaneously work with it (my processing is complex but each chunk will be independent and self-contained)
Doing this will allow my logic to be running in parallel with the data upload across the internet, and avoid wasting all that time while the data is just being transmitted
It will also prevent the gigabytes/10s GB to be put all into RAM before my logic even begins.
Original Question:
I'm trying to build a web service (in Python) which can accept potentially tens of gigabytes of data and process this data. I don't want this to be completely received and built into an in-memory object before passing to my logic as a) this will use a ton of memory, and b) the processing will be pretty slow and I'd love to have a processing thread working on chunks of the data while the rest of the data is being received asynchronously.
I believe I need some sort of streaming solution for this but I'm having trouble finding any Python solution to handle this case. Most things I've found are about streaming the output (not an issue for me). Also it seems like wsgi has issues by design with a data streaming solution.
Is there a best practice for this sort of issue which I'm missing? And/or, is there a solution that I haven't found?
Edit: Since a couple of people asked, here's an example of the sort of data I'd be looking at. Basically I'm working with lists of sentences, which may be millions of sentences long. But each sentence (or group of sentences, for ease) is a separate processing task. Originally I had planned on receiving this as a json array like:
{"sentences: [
"here's a sentence",
"here's another sentence",
"I'm also a sentence"
]
}
For this modification I'm thinking it would just be newline delimited sentences, since I don't really need the json structure. So in my head, my solution would be; I get a constant stream of characters, and whenever I get a newline character, I'd split off the previous sentence and pass it to a worker thread or threadpool to do my processing. I could also do in groups of many sentences to avoid having a ton of threads going at once. But the main thing is, while the main thread is getting this character stream, it is splitting off tasks periodically so other threads can start the processing.
Second Edit: I've had a few thoughts on how to process the data. I can't give tons of specific details as it's proprietary, but I could either store the sentences as they come in into ElasticSearch or some other database, and have an async process working on that data, or (ideally) I'd just work with the sentences (in batches) in memory. Order is important, and also not dropping any sentences is important. The inputs will be coming from customers over the internet though, so that's why I'm trying to avoid a message queue like process, so there's not the overhead of a new call for each sentence.
Ideally, the customer of the webservice doesn't have to do anything particularly special other than do the normal POST request with a gigantic body, and all this special logic is server-side. My customers won't be expert software engineers so while a webservice call is perfectly within their wheelhouse, handling a more complex message queue process or something along those lines isn't something I want to impose on them.
Unless you share a little more of the type of data, processing or what other constraints your problem has, it's going to be very difficult to provide more tailored advice than maybe pointing you to a couple resources.
... Here is my attempt, hope it helps!
It seems like what you need is the following:
A message passing system vs streaming system in order to deliver/receive the data
Optionally, an asynchronous task queue to fire up different processing tasks on the data
or even a custom data processing pipeline system
Messaging vs Streaming
Examples: RabbitMQ, Kombu (per #abolotnov's comment), Apache Kafka (and python ports), Faust
The main differences between messaging and streaming can vary on the system/definition/who you ask, but in general:
- messaging: a "simple" system that will take care of sending/receiving single messages between two processes
- streaming adds functionality like the ability to "replay", send mini-batches of groups of messages, process rolling windows, etc.
Messaging systems may implement as well broadcasting (send message to all receivers) and publish/subscribe scenarios, that would come handy if you don't want your publisher (creator of data) to keep track of who to send the data to (subscribers), or alternatively your subscribers to keep track who and when to go and get the data from.
Asynchronous task queue
Examples: Celery, RQ, Taskmaster
This will basically help you assign a set of tasks that may be the smaller chunks of the main processing you are intending to do, and then make sure these tasks get performed whenever new data pops up.
Custom Data Processing Systems
I mainly have one in mind: Dask (official tutorial repo)
This is a system very much created for what seems to me you have in your hands. Basically large amounts of information emerging from some source (that may or not be fully under your control), that need to flow through a set of processing steps in order to be consumable by some other process (or stored).
Dask is kind of a combination of the previous, in that you define a computation graph (or task graph) with data sources and computation nodes that connect and some may depend on other nodes. Later, and dependent on the system you deploy it on, you can specify for sync or different types of async in which the tasks will be able to be executed, but keeping this run-time implementation detail separate from the actual tasks to be performed. This means, you could deploy on your computer, but later decide to deploy the same pipeline on a cluster, and you would only need to change the "settings" of this run-time implementation.
Additionally, Dask basically imitates numpy / pandas / pyspark or whatever data processing framework you may be already using, so the syntax will be (almost in every case) virtually the same.
Related
I've run into a specific problem and thought of an solution. But since the solution is pretty involved, I was wondering if others have encountered something similar and could comment on best practises or propose alternatives.
The problem is as follows:
I have a webapp written in Django which has some screen in which data from multiple tables is collected, grouped and aggregated in time intervals.
It's basically a big excel like matrix where we have data aggregated in time intervals on one axis, against resources for the aggregated data per interval on the other axis.
It involves many inner and left joins to gather all data, and because of the "report" like character of the presented data, I use raw sql to query everything together.
The problem is that multiple users can concurrently view & edit data in these intervals. They can also edit data on finer or coarser granularities than other users working with the same data, but in sub/overlapping intervals. Currently, when a user edits some data, a django request is fired, the data is altered, the affected intervals are aggregated & grouped again and presented back. But because of the volatile nature of this data, other users might have changed something before them. Also grouping/aggregating and rerendering the table each time is a very heavy operation (depending on amount of data and range of the intervals). This gets worse with concurrent users editting..
My proposed solution:
It's clear a http request/response mechanism is not really ideal for this kind of thing; The grouping/aggregation is pretty heavyweight, not ideal to do this per request, the concurrency would ideally be channeled amongst users, and feedback should be realtime like googledocs instead of full page refreshes.
I was thinking about making a daemon process which reads in flat data of interestfrom the dbms on request and caches this in memory. All changes to the data would then occur in memory with a write-through to the dbms. This daemon channels access to the data through a lock, so the daemon can handle which users can overwrite others changes.
The flat data is aggregated and grouped using python code and only the slices required by the user are returned; user/daemon communication would run over websockets. The daemon would provide a subscriber/publisher channel, where users interested in specific slices of data are notified when something changes. This daemon could be implemented using a framework like twisted. But I'm not sure an event driven approach would work here, as we want to "channel" all incomming requests... Maybe these should be put in a queue and be run in a seperate thread? Would it be better to have twisted run in a thread next to my scheduler, or should the twisted main loop spin off a thread that works on this queue? My understanding is that threading works best for IO, and python heavy code basically blocks other threads. I have both (websockets/dbms and processing data), would that work?
Has anyone done something similar before?
Thanks in advance!
Karl
The scheme Google implemented for the now abandoned Wave product's concurrent editing features is documented, http://www.waveprotocol.org/whitepapers/operational-transform. This aspect of Wave seemed like a success, even though Wave itself was quickly abandoned.
As far as the questions you asked about implementing your proposed scheme:
An event driven system is perfectly capable of implementing this idea. Being event driven is a way to organize your code. It doesn't prevent you from implementing any particular functionality.
Threading doesn't work best for very much, particularly in Python.
It has significant disadvantages for CPU-bound work, since CPython only runs a single Python thread at a time (regardless of available hardware resources). This means a multi-threaded CPU-bound Python program is typically no faster, or even slower, than the single-threaded equivalent.
For IO, this shortcoming is less of a limitation, because IO does not involve running Python code on CPython (the IO APIs are all implemented in C). This means you can do IO in multiple threads concurrently, so threading is potentially a benefit. However, doing IO concurrently in a single thread is exactly what Twisted is for. Threading offers no benefits over doing the IO in a single thread, as long as you're doing the IO non-blockingly (or perhaps asychronously).
Hello world.
I tried something similar and you might be interested in the solution. Here is my question:
python Socket.IO client for sending broadcast messages to TornadIO2 server
And this is the answer:
https://stackoverflow.com/a/10950702/675065
He also wrote a blog post about the solution:
http://blog.y3xz.com/blog/2012/06/08/a-modern-python-stack-for-a-real-time-web-application/
The software stack consists of:
SockJS Client
SockJS Tornado Server
Redis Pub/Sub
Django Redis Client: Brukva
I implemented this myself and it works like a charm.
Suppose that one is interested to write a python app where there should be communication between different processes. The communications will be done by sending strings and/or numpy arrays.
What are the considerations to prefer OpenMPI vs. a tool like RabbitMQ?
There is no single correct answer to such question. It all depends on a big number of different factors. For example:
What kind of communications do you have? Are you sending large packets or small packets, do you need good bandwidth or low latency?
What kind of delivery guarantees do you need?
OpenMPI can instantly deliver messages only to a running process, while different MQ solutions can queue messages and allow fancy producer-consumer configurations.
What kind of network do you have? If you are running on the localhost, something like ZeroMQ would probably be the fastest. If you are running on the set of hosts, depends on the interconnections available. E.g. OpenMPI can utilize infiniband/mirynet links.
What kind of processing are you doing? With MPI all processes are usually started at the same time, do the processing and terminate all at once.
This is exactly the scenario I was in a few months ago and I decided to use AMQP with RabbitMQ using topic exchanges, in addition to memcache for large objects.
The AMQP messages are all strings, in JSON object format so that it is easy to add attributes to a message (like number of retries) and republish it. JSON objects are a subset of JSON that correspond to Python dicts. For instance {"recordid": "272727"} is a JSON object with one attribute. I could have just pickled a Python dict but that would have locked us into only using Python with the message queues.
The large objects don't get routed by AMQP, instead they go into a memcache where they are available for another process to retrieve them. You could just as well use Redis or Tokyo Tyrant for this job. The idea is that we did not want short messages to get queued behind large objects.
In the end, my Python processes ended up using both AMQP and ZeroMQ for two different aspects of the architecture. You may find that it makes sense to use both OpenMPI and AMQP but for different types of jobs.
In my case, a supervisor process runs forever, starts a whole flock of worker who also run forever unless they die or hang, in which case the supervisor restarts them. The work constantly flows in as messages via AMQP, and each process handles just one step of the work, so that when we identify a bottleneck we can have multiple instances of the process, possibly on separate machines, to remove the bottleneck. In my case, I have 15 instances of one process, 4 of two others, and about 8 other single instances.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a python library or a command line tool for downloading multiple files in parallel. My current solution is to download the files sequentially which is slow. I know you can easily write a half-assed threaded solution in python, but I always run into annoying problem when using threading. It is for polling a large number of xml feeds from websites.
My requirements for the solution are:
Should be interruptable. Ctrl+C should immediately terminate all downloads.
There should be no leftover processes that you have to kill manually using kill, even if the main program crashes or an exception is thrown.
It should work on Linux and Windows too.
It should retry downloads, be resilient against network errors and should timeout properly.
It should be smart about not hammering the same server with 100+ simultaneous downloads, but queue them in a sane way.
It should handle important http status codes like 301, 302 and 304. That means that for each file, it should take the Last-Modified value as input and only download if it has changed since last time.
Preferably it should have a progress bar or it should be easy to write a progress bar for it to monitor the download progress of all files.
Preferably it should take advantage of http keep-alive to maximize the transfer speed.
Please don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.
I guess I should describe what I want it for too... I have about 300 different data feeds as xml formatted files served from 50 data providers. Each file is between 100kb and 5mb in size. I need to poll them frequently (as in once every few minutes) to determine if any of them has new data I need to process. So it is important that the downloader uses http caching to minimize the amount of data to fetch. It also uses gzip compression obviously.
Then the big problem is how to use the bandwidth in an as efficient manner as possible without overstepping any boundaries. For example, one data provider may consider it abuse if you open 20 simultaneous connections to their data feeds. Instead it may be better to use one or two connections that are reused for multiple files. Or your own connection may be limited in strange ways.. My isp limits the number of dns lookups you can do so some kind of dns caching would be nice.
You can try pycurl, though the interface is not easy at first, but once you look at examples, its not hard to understand. I have used it to fetch 1000s of web pages in parallel on meagre linux box.
You don't have to deal with threads, so it terminates gracefully, and there are no processes left behind
It provides options for timeout, and http status handling.
It works on both linux and windows.
The only problem is that it provides a basic infrastructure (basically just a python layer above the excellent curl library). You will have to write few lines to achieve the features as you want.
There are lots of options but it will be hard to find one which fits all your needs.
In your case, try this approach:
Create a queue.
Put URLs to download into this queue (or "config objects" which contain the URL and other data like the user name, the destination file, etc).
Create a pool of threads
Each thread should try to fetch a URL (or a config object) from the queue and process it.
Use another thread to collect the results (i.e. another queue). When the number of result objects == number of puts in the first queue, then you're finished.
Make sure that all communication goes via the queue or the "config object". Avoid accessing data structures which are shared between threads. This should save you 99% of the problems.
I don't think such a complete library exists, so you'll probably have to write your own. I suggest taking a look at gevent for this task. They even provide a concurrent_download.py example script. Then you can use urllib2 for most of the other requirements, such as handling HTTP status codes, and displaying download progress.
I would suggest Twisted, although it is not a ready made solution, but provides the main building blocks to get every feature you listed in an easy way and it does not use threads.
If you are interested, take a look at the following links:
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#getPage
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#downloadPage
As per your requirements:
Supported out of the box
Supported out of the box
Supported out of the box
Timeout supported out of the box, other error handling done through deferreds
Achieved easily using cooperators (example 7)
Supported out of the box
Not supported, solutions exists (and they are not that hard to implement)
Not supported, it can be implemented (but it will be relatively hard)
Nowadays there are excellent Python libs you might want to use - urllib3 and requests
Try using aria2 through simple python subprocess module.
It provide all requirements from your list, except 7, out of the box, and 7 is easy to write.
aria2c has a nice xml-rpc or json-rpc interface to interact with it from your scripts.
Does urlgrabber fit your requirements?
http://urlgrabber.baseurl.org/
If it doesn't, you could consider volunteering to help finish it. Contact the authors, Michael Stenner and Ryan Tomayko.
Update: Googling for "parallel wget" yields these, among others:
http://puf.sourceforge.net/
http://www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget
It seems like you have a number of options to choose from.
I used the standard libs for that, urllib.urlretrieve to be precise. downloaded podcasts this way, via a simple thread pool, each using its own retrieve. I did about 10 simultanous connections, more should not be a problem. Continue a interrupted download, maybe not. Ctrl-C could be handled, I guess. Worked on Windows, installed a handler for progress bars. All in all 2 screens of code, 2 screens for generating the URLs to retrieve.
This seems pretty flexible:
http://keramida.wordpress.com/2010/01/19/parallel-downloads-with-python-and-gnu-wget/
Threading isn't "half-assed" unless you're a bad programmer. The best general approach to this problem is the producer / consumer model. You have one dedicated URL producer, and N dedicated download threads (or even processes if you use the multiprocessing model).
As for all of your requirements, ALL of them CAN be done with the normal python threaded model (yes, even catching Ctrl+C -- I've done it).
I have a network of 100 machines, all running Ubuntu Linux.
On a continuous (streaming) basis, machine X is 'fed' with some real-time data. I need to write a python script that would get the data as input, load it in-memory, process it, and then save it to disk.
It's a lot of data, hence, I would ideally want to split the data in memory (using some logic) and just send pieces of it to each individual computer, in the fastest possible way. each individual computer will accept its piece of data, handle it and write it to its local disk.
Suppose I have a container of data in Python (be it a list, a dictionary etc), already processed and split to pieces. What is the fastest way to send each 'piece' of data to each individual machine?
You should take a look at pyzmq:
http://www.zeromq.org/bindings:python
and general guides to zeromq (0mq)
http://nichol.as/zeromq-an-introduction
http://www.zeromq.org/
You have two (classes of) choices:
You could build some distribution mechanism yourself.
You could use an existing tool to handle the distribution and storage.
In the simplest case, you write a program on each machine in your network that simply listens, processes and writes. You distribute from X to each machine in your pool round-robin. But, you might want to address higher-level concerns like handling node failures or dealing with requests that take longer to process than others, adding new nodes to the system, etc.
As you want more functionality, you'll probably want to find some existing tool to help you. It sounds like you might want to investigate some combinations of AMQP (for reliable messaging), Hadoop (for distributed data processing) or more complete NoSQL solutions like Cassandra or Riak. By leveraging these tools, your system will be significantly more robust than what you could probably build out yourself.
What you want is a message queue like RabbitMQ. It is easy to add consumers and producers to a queue. Consumer can either poll or get notified through a callback...
My question is: which python framework should I use to build my server?
Notes:
This server talks HTTP with it's clients: GET and POST (via pyAMF)
Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
submit and retrieve might be separated by days - different HTTP connections
The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
When a server gets a "task", it queues it for processing
The server manages this queue and, when tasks get to the top, organises that they are processed.
the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)
I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.
But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?
I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.
Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)
By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.
Some popular python implementations of message queues:
http://code.google.com/p/stomper/
http://code.google.com/p/pyactivemq/
http://xph.us/software/beanstalkd/
I'd suggest the following. (Since it's what we're doing.)
A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.
I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.
A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.
That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.
If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.
My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.
Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html
You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.
You can have a look at celery
It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.
Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.