Architecting a multithreaded app with a REST interface [closed]

Architecting a multithreaded app with a REST interface [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm working through possible architectures for a problem. In as few words as possible, the problem is: I need to design a system that allows clients to connect using HTTP/REST to kick off long running processes. Each process will create a persistent connection to a third party server and write the received data to a queue. Each process will terminate only if the third party server closes the connection or another HTTP/REST request is received indicating it should be terminated.
Constraints and background:
Clients must be able to connect using HTTP/REST
System must be written in Python
I'm a lower level C guy (with enough Python experience to feel competent) but trying to wrap my head around the Python frameworks available for making this easier. My gut is to jump into the weeds and I know if I implement this as I'm thinking, I might as well have written it in C. Don't want that. I want to leverage as many frameworks and libraries for Python as possible. Performance is not a top priority.
Approaches I've considered:
In doing research, I came across Twisted which might be a fit and seems to make sense to me (thinking about this as a daemon). I'm imagining the final product would be a Twisted app that exposes a REST interface, dispatches new threads connecting to the third party service for each client request received, and would manage its own thread pool. I'm familiar with threading, though admittedly haven't done anything in Python with them yet. In a nutshell, Twisted looks very cool, though in the end, I'm left wondering if I'm overcomplicating this.
The second approach I considered is using Celery and Flask and simply let Celery handle all the dispatching, thread management, etc. I found this article showing Celery and Flask playing nicely together. It seems much like a much simpler approach.
After writing this, I'm leaning towards the second option of using Celery and Flask, though I don't know much about Celery, so looking for any advice you might have, as well as other possible architectures that I'm not considering. I really appreciate it and thank you in advance.

Yes, Twisted is overkill here.
From what you described, the combination of Celery and Flask would suffice. It would allow you to implement a REST interface that kicks off your long running processes as Celery tasks. You can easily implement a REST method allowing clients to stop running tasks by invoking Celery's revoke method on a tasks ID. Take note that Celery depends on a Message Broker for sending and receiving messages (frequently RabbitMQ is used) and a data backend for storing results (frequently Redis is used).
>>> from celery.task.control import revoke
>>> revoke(task_id, terminate=True)
http://docs.celeryproject.org/en/latest/userguide/workers.html#revoking-tasks

Related

Run Python script in background on remote server with Django view [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
What I want to achieve is to run python some script which will collect data and insert it to DB in a background.
So basically, a person opens Django view, clicks on a button and then closes the browser and Django launches this script on a server, the script then collects data in background while everything else goes on its own.
What is the best library, framework, module or package to achieve such functionality?

Celery is the most used tool for such tasks.

Celery is a good suggestion, but its a bit heavy solution and there more simple and straightforward solution exist unless you need full power of celery.
So i suggest to use rq and django integration of rq.
RQ inspired by the good parts of Celery, Resque , and has been created as a lightweight alternative to the heaviness of Celery or other AMQP-based queuing implementations.

I'd humbly reccomend the standard library module multiprocessing for this. As long as the background process can run on the same server as the one processing the requests, you'll be fine.
Although i consider this to be the simplest solution, this wouldn't scale well at all, since you'd be running extra processess on your server. If you expect these things to only happen once in a while, and not to last that long, it's a good quick solution.
One thing to keep in mind though: In the newly started process ALWAYS close your database connection before doing anything - this is because the forked process shares the same connection to the SQL server and might enter into data races with your main django process.

Python IPC - Twisted, RabbitMQ,

I want to create 2 applications in Python which should communicate with each other. One of these application should behave like a server and the second should be the GUI of a client. They could be run on the same system(on the same machine) or remotely and on different devices.
I want to ask you, which technology should I use - an AMQP messaging (like RabbitMQ), Twisted like server (or Tornado) or ZeroMQ and connect applications to it. In the future I would like to have some kind of authentication etc.
I have read really lot of questions and articles (like this one: Why do we need to use rabbitmq), and a lot of people are telling "rabbitmq and twisted are different". I know they are. I really love to know the differences and why one of these solutions will be superior than the other in this case.
EDIT:
I want to use it with following requirements:
There will be more than 1 user connected at a time - I think there will be 1 - 10 users connected to the same program and they would work collaboratively
The data send are "messages" telling what user did - something like remote calls (but don't focus on that, because the GUIS can be written in different languages, so the messages will be something like json informations).
The system should allow for collaborative work - so it should be as interactive as possible. (data will be send all the time when user something types or performs some action).
Additional I would love to hear why one solution would be better than the other not only in this particular case.

Twisted is used to solve the C10k networking problem by giving you asynchronous networking through the Reactor Pattern. Its also convenient because it provides a nice concurrency abstraction as threading/concurrency in Python is not as easy as say Erlang. Consequently some people use Twisted to dispatch work tasks but thats not what it is designed for.
RabbitMQ is based on the message queue pattern. Its all about reliable message passing and is not about networking. I stress the reliable part as there are many different asynchronous networking frameworks (Vert.x for example) that provide message passing (also known as pub/sub).
More often than not most people combine the two patterns together to create a "message bus" that will work with a variety of networking needs with out unnecessary network blocking and for great integration and scalability.
The reason a "message queue" goes so well with a networking "reactor loop" is that you should not block on the reactor loop so you have to dispatch blocking work to some other process (thread, lwp, separate machine process, queue, etc..). In practice the cleanest way to do this is distributed message passing.
Based on your requirements it sounds like you should use asynchronous networking if you want the results to show up instantly and scale but you could probably get away with a simple system that just polls given you only have handful of clients. So the question is how many total users (Twisted)? And how reliable do you want the updates to be delivered (RabbitMQ)? Finally do you want your architecture to be language and platform agnostic... maybe you want to use Node.js later (focus on the message queue instead of async networking... ie RabbitMQ). Personally I would take a look at Vert.x which allows you to write in Python.

When someone is telling you that Twisted and RabbitMQ is different is because compare both is like compare two things with different target.
Twisted is a asynchronous framework, like Tornadao. RabbitMQ is a message queue system. You can't compare each one straight for.
You should turn your ask into a two new questions, the first one wich protocol should I use to communicate my process ? The answer can be figure out with words like amqp, Protocol Buffers ...
And the other one, which framework should I use to write my client and server program ? Here the answer can fall on Twisted, Tornado, ....

Pros and cons to use Celery vs. RQ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Currently I'm working on python project that requires implement some background jobs (mostly for email sending and heavily database updates). I use Redis for task broker. So in this point I have two candidates: Celery and RQ. I had some experience with these job queues, but I want to ask you guys to share you experience of using this tools. So.
What pros and cons to use Celery vs. RQ.
Any examples of projects/task suitable to use Celery vs. RQ.
Celery looks pretty complicated but it's full featured solution. Actually I don't think that I need all these features. From other side RQ is very simple (e.g configuration, integration), but it seems that it lacks some useful features (e.g task revoking, code auto-reloading)

Here is what I have found while trying to answer this exact same question. It's probably not comprehensive, and may even be inaccurate on some points.
In short, RQ is designed to be simpler all around. Celery is designed to be more robust. They are both excellent.
Documentation. RQ's documentation is comprehensive without being complex, and mirrors the project's overall simplicity - you never feel lost or confused. Celery's documentation is also comprehensive, but expect to be re-visiting it quite a lot when you're first setting things up as there are too many options to internalize
Monitoring. Celery's Flower and the RQ dashboard are both very simple to setup and give you at least 90% of all information you would ever want
Broker support. Celery is the clear winner, RQ only supports Redis. This means less documentation on "what is a broker", but also means you cannot switch brokers in the future if Redis no longer works for you. For example, Instagram considered both Redis and RabbitMQ with Celery. This is important because different brokers have different guarantees e.g. Redis cannot (as of writing) guarantee 100% that your messages are delivered.
Priority queues. RQs priority queue model is simple and effective - workers read from queues in order. Celery requires spinning up multiple workers to consume from different queues. Both approaches work
OS Support. Celery is the clear winner here, as RQ only runs on systems that support fork e.g. Unix systems
Language support. RQ only supports Python, whereas Celery lets you send tasks from one language to a different language
API. Celery is extremely flexible (multiple result backends, nice config format, workflow canvas support) but naturally this power can be confusing. By contrast, the RQ api is simple.
Subtask support. Celery supports subtasks (e.g. creating new tasks from within existing tasks). I don't know if RQ does
Community and Stability. Celery is probably more established, but they are both active projects. As of writing, Celery has ~3500 stars on Github while RQ has ~2000 and both projects show active development
In my opinion, Celery is not as complex as its reputation might lead you to believe, but you will have to RTFM.
So, why would anyone be willing to trade the (arguably more full-featured) Celery for RQ? In my mind, it all comes down to the simplicity. By restricting itself to Redis+Unix, RQ provides simpler documentation, simpler codebase, and a simpler API. This means you (and potential contributors to your project) can focus on the code you care about, instead of having to keep details about the task queue system in your working memory. We all have a limit on how many details can be in our head at once, and by removing the need to keep task queue details in there RQ lets get back to the code you care about. That simplicity comes at the expense of features like inter-language task queues, wide OS support, 100% reliable message guarantees, and ability to switch message brokers easily.

Celery is not that complicated. At its core, you do the step by step configuration from the tutorials, create a celery instance, decorate your function with #celery.task then run the task with my_task.delay(*args, **kwargs).
Judging from your own assessment, it seems you have to choose between lacking (key) features or having some excess hanging around. That is not too hard of a choice in my book.

Distributed programming in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I plan do program a simple data flow framework, which basically consists of lazy method calls of objects. If I ever consider distributed programming, what is the easiest way to enable that in Python? Any transparent solution without me doing network programming?
Or for a start, how can I make use of multi-core processors in Python?

lazy method calls of objects
Can be anything at all really, so let's break it down:
Simple Let-Me-Call-That-Function (RPC)
Well lucky you! python has the one of greatest implementations of Remote Procedure Calls:
RPyC.
Just run the server (double click a file, see the tutorial),
Open an interpreter and:
import rpyc
conn = rpyc.classic.connect("localhost")
data_obj = conn.modules.lazyme.AwesomeObject("ABCDE")
print(data_obj.calculate(10))
And a lazy version (async):
# wrap the remote function with async(), which turns the invocation asynchronous
acalc = rpyc.async(data_obj.calculate)
res = acalc(10)
print(res.ready, res.value)
Simple Data Distribution
You have a defined unit of work, say a complex image manipulation.
What you do is roughly create Node(s), which does the actual work (aka, take an image, do the manipulation, and return the result), someone who collect the results (a Sink) and someone who create the work (the Distributor).
Take a look at Celery.
If it's very small scale, or if you just want to play with it, see the Pool object in the multiprocessing package:
from multiprocessing import Pool
p = Pool(5)
def f(x):
return x*x
print(p.map(f, [1,2,3]))
And the truly-lazy version:
print(p.map_async(f, [1,2,3]))
Which returns a Result object which can be inspected for results.
Complex Data Distribution
Some multi-level more-than-just-fire&forget complex data manipulation, or a multi-step processing use case.
In such case, you should use a Message Broker such as ZeroMQ or RabbitMQ.
They allow to you send 'messages' across multiple servers with great ease.
They save you from the horrors of the TCP land, but they are a bit more complex (some, like RabbitMQ, require a separate process/server for the Broker). However, they give you much more fine-grained control over the flow of data, and help you build a truly scalable application.
Lazy-Anything
While not data-distribution per se, It is the hottest trend in web server back-ends: use 'green' threads (or events, or coroutines) to delegate IO heavy tasks to a dedicated thread, while the application code is busy maxing-out the CPU.
I like Eventlet a lot, and gevent is another option.

Try Gearman http://gearman.org/
Gearman provides a generic application framework to farm out work to
other machines or processes that are better suited to do the work. It
allows you to do work in parallel, to load balance processing, and to
call functions between languages. It can be used in a variety of
applications, from high-availability web sites to the transport of
database replication events. In other words, it is the nervous system
for how distributed processing communicates.

Please read python.org official resoureces as the starter:
http://wiki.python.org/moin/ParallelProcessing

Another framework you might consider is Versile Python (full disclosure: I am a VPy developer). Documentation recipes has relevant code examples. With the framework it is easy to set up and connect to services, and you can either define explicit public method interfaces to classes or use the native python type framework to remotely access local methods.
Note you would have to set up your program to run in multiple processes in order to take advantage of multiple cores (due to the python global interpreter lock).

What's so cool about Twisted? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm increasingly hearing that Python's Twisted framework rocks and other frameworks pale in comparison.
Can anybody shed some light on this and possibly compare Twisted with other network programming frameworks.

There are a lot of different aspects of Twisted that you might find cool.
Twisted includes lots and lots of protocol implementations, meaning that more likely than not there will be an API you can use to talk to some remote system (either client or server in most cases) - be it HTTP, FTP, SMTP, POP3, IMAP4, DNS, IRC, MSN, OSCAR, XMPP/Jabber, telnet, SSH, SSL, NNTP, or one of the really obscure protocols like Finger, or ident, or one of the lower level protocol-building-protocols like DJB's netstrings, simple line-oriented protocols, or even one of Twisted's custom protocols like Perspective Broker (PB) or Asynchronous Messaging Protocol (AMP).
Another cool thing about Twisted is that on top of these low-level protocol implementations, you'll often find an abstraction that's somewhat easier to use. For example, when writing an HTTP server, Twisted Web provides a "Resource" abstraction which lets you construct URL hierarchies out of Python objects to define how requests will be responded to.
All of this is tied together with cooperating APIs, mainly due to the fact that none of this functionality is implemented by blocking on the network, so you don't need to start a thread for every operation you want to do. This contributes to the scalability that people often attribute to Twisted (although it is the kind of scalability that only involves a single computer, not the kind of scalability that lets your application grow to use a whole cluster of hosts) because Twisted can handle thousands of connections in a single thread, which tends to work better than having thousands of threads, each for a single connection.
Avoiding threading is also beneficial for testing and debugging (and hence reliability in general). Since there is no pre-emptive context switching in a typical Twisted-based program, you generally don't need to worry about locking. Race conditions that depend on the order of different network events happening can easily be unit tested by simulating those network events (whereas simulating a context switch isn't a feature provided by most (any?) threading libraries).
Twisted is also really, really concerned with quality. So you'll rarely find regressions in a Twisted release, and most of the APIs just work, even if you aren't using them in the common way (because we try to test all the ways you might use them, not just the common way). This is particularly true for all of the code added to Twisted (or modified) in the last 3 or 4 years, since 100% line coverage has been a minimum testing requirement since then.
Another often overlooked strength of Twisted is its ten years of figuring out different platform quirks. There are lots of undocumented socket errors on different platforms and it's really hard to learn that they even exist, let alone handle them. Twisted has gradually covered more and more of these, and it's pretty good about it at this point. Younger projects don't have this experience, so they miss obscure failure modes that will probably only happen to users of any project you release, not to you.
All that say, what I find coolest about Twisted is that it's a pretty boring library that lets me ignore a lot of really boring problems and just focus on the interesting and fun things. :)

Well it's probably according to taste.
Twisted allows you to easily create event driven network servers/clients, without really worrying about everything that goes into accomplishing this. And thanks to the MIT License, Twisted can be used almost anywhere. But I haven't done any benchmarking so I have no idea how it scales, but I'm guessing quite good.
Another plus would be the Twisted Projects, with which you can quickly see how to implement most of the server/services that you would want to.
Twisted also has some great documentation, when I started with it a couple of weeks ago I was able to quickly get a working prototype.
Quite new to the python scene please correct me if i'm in the wrong.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.