Distributed programming in Python [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I plan do program a simple data flow framework, which basically consists of lazy method calls of objects. If I ever consider distributed programming, what is the easiest way to enable that in Python? Any transparent solution without me doing network programming?
Or for a start, how can I make use of multi-core processors in Python?

lazy method calls of objects
Can be anything at all really, so let's break it down:
Simple Let-Me-Call-That-Function (RPC)
Well lucky you! python has the one of greatest implementations of Remote Procedure Calls:
RPyC.
Just run the server (double click a file, see the tutorial),
Open an interpreter and:
import rpyc
conn = rpyc.classic.connect("localhost")
data_obj = conn.modules.lazyme.AwesomeObject("ABCDE")
print(data_obj.calculate(10))
And a lazy version (async):
# wrap the remote function with async(), which turns the invocation asynchronous
acalc = rpyc.async(data_obj.calculate)
res = acalc(10)
print(res.ready, res.value)
Simple Data Distribution
You have a defined unit of work, say a complex image manipulation.
What you do is roughly create Node(s), which does the actual work (aka, take an image, do the manipulation, and return the result), someone who collect the results (a Sink) and someone who create the work (the Distributor).
Take a look at Celery.
If it's very small scale, or if you just want to play with it, see the Pool object in the multiprocessing package:
from multiprocessing import Pool
p = Pool(5)
def f(x):
return x*x
print(p.map(f, [1,2,3]))
And the truly-lazy version:
print(p.map_async(f, [1,2,3]))
Which returns a Result object which can be inspected for results.
Complex Data Distribution
Some multi-level more-than-just-fire&forget complex data manipulation, or a multi-step processing use case.
In such case, you should use a Message Broker such as ZeroMQ or RabbitMQ.
They allow to you send 'messages' across multiple servers with great ease.
They save you from the horrors of the TCP land, but they are a bit more complex (some, like RabbitMQ, require a separate process/server for the Broker). However, they give you much more fine-grained control over the flow of data, and help you build a truly scalable application.
Lazy-Anything
While not data-distribution per se, It is the hottest trend in web server back-ends: use 'green' threads (or events, or coroutines) to delegate IO heavy tasks to a dedicated thread, while the application code is busy maxing-out the CPU.
I like Eventlet a lot, and gevent is another option.

Try Gearman http://gearman.org/
Gearman provides a generic application framework to farm out work to
other machines or processes that are better suited to do the work. It
allows you to do work in parallel, to load balance processing, and to
call functions between languages. It can be used in a variety of
applications, from high-availability web sites to the transport of
database replication events. In other words, it is the nervous system
for how distributed processing communicates.

Please read python.org official resoureces as the starter:
http://wiki.python.org/moin/ParallelProcessing

Another framework you might consider is Versile Python (full disclosure: I am a VPy developer). Documentation recipes has relevant code examples. With the framework it is easy to set up and connect to services, and you can either define explicit public method interfaces to classes or use the native python type framework to remotely access local methods.
Note you would have to set up your program to run in multiple processes in order to take advantage of multiple cores (due to the python global interpreter lock).

Related

Architecting a multithreaded app with a REST interface [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm working through possible architectures for a problem. In as few words as possible, the problem is: I need to design a system that allows clients to connect using HTTP/REST to kick off long running processes. Each process will create a persistent connection to a third party server and write the received data to a queue. Each process will terminate only if the third party server closes the connection or another HTTP/REST request is received indicating it should be terminated.
Constraints and background:
Clients must be able to connect using HTTP/REST
System must be written in Python
I'm a lower level C guy (with enough Python experience to feel competent) but trying to wrap my head around the Python frameworks available for making this easier. My gut is to jump into the weeds and I know if I implement this as I'm thinking, I might as well have written it in C. Don't want that. I want to leverage as many frameworks and libraries for Python as possible. Performance is not a top priority.
Approaches I've considered:
In doing research, I came across Twisted which might be a fit and seems to make sense to me (thinking about this as a daemon). I'm imagining the final product would be a Twisted app that exposes a REST interface, dispatches new threads connecting to the third party service for each client request received, and would manage its own thread pool. I'm familiar with threading, though admittedly haven't done anything in Python with them yet. In a nutshell, Twisted looks very cool, though in the end, I'm left wondering if I'm overcomplicating this.
The second approach I considered is using Celery and Flask and simply let Celery handle all the dispatching, thread management, etc. I found this article showing Celery and Flask playing nicely together. It seems much like a much simpler approach.
After writing this, I'm leaning towards the second option of using Celery and Flask, though I don't know much about Celery, so looking for any advice you might have, as well as other possible architectures that I'm not considering. I really appreciate it and thank you in advance.
Yes, Twisted is overkill here.
From what you described, the combination of Celery and Flask would suffice. It would allow you to implement a REST interface that kicks off your long running processes as Celery tasks. You can easily implement a REST method allowing clients to stop running tasks by invoking Celery's revoke method on a tasks ID. Take note that Celery depends on a Message Broker for sending and receiving messages (frequently RabbitMQ is used) and a data backend for storing results (frequently Redis is used).
>>> from celery.task.control import revoke
>>> revoke(task_id, terminate=True)
http://docs.celeryproject.org/en/latest/userguide/workers.html#revoking-tasks

Efficient Python IPC [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm making an application in Python3, which will be divided in batch and gui parts.
Batch is responsible for processing logic and gui is responsible for displaying it.
Which inter-process communication (IPC) framework should I use with the following requirements:
The GUI can be run on other device than batch (GUI can be run on the same device, on smartphone, tablet etc, locally or over network).
The batch (Python3 IPc library) should work with no problem on Linux, Mac, Windows, ...
The IPC should support GUI written in different languages (Python, Javascript, ...)
The performance of IPC is important - it should be as "interactive" as possible, but without losing information.
Several GUI could be connected to the same batch.
additional: Will the choice be other if the GUI will be guaranteed to be written in Python also?
Edit:
I have found a lot of IPC libraries, like here: Efficient Python to Python IPC or ActiveMQ or RabbitMQ or ZeroMQ or.
The best looking options I have found so far are:
rabbitmq
zeromq
pyro
Are they appropriate slutions to this problem? If not why? And if something is better, please tell me why also.
The three you mentioned seem a good fit and will uphold your requirements. I think you should go on with what you feel most comfortable\familiar with.
From my personal experience, I do believe ZeroMQ is the best combination between efficiency, ease of use and inter-operability. I had an easy time integrating zmq 2.2 with Python 2.7, so that would be my personal favorite. However as I said I'm quite sure you can't go wrong with all 3 frameworks.
Half related: Requirements tend to change with time, you may decide to switch framework later on, therefore encapsulating the dependency on the framework would be a good design pattern to use. (e.g. having a single conduit module that interacts with the framework and have its API use your internal datastructures and domain language)
I've used the Redis engine for this. Extremely simple, and lightweight.
Server side does:
import redis
r = redis.Redis() # Init
r.subscribe(['mychannel']) # Subscribe to "channel"
for x in r.listen():
print "I got message",x
Client side does:
import redis
r = redis.Redis() # Init
r.publish('mychannel',mymessage)
"messages" are strings (of any size). If you need to pass complex data structures, I like to use json.loads and json.dumps to convert between python dicts/arrays and strings -
"pickle" is perhaps the better way to do this for python-to-python communication, though JSON means "the other side" can be written in anything.
Now there are a billion other things Redis is good for - and they all inherently are just as simple.
You are asking for a lot of things from the framework; network enabled, multi-platform, multi-language, high performance (which ideally should be further specified - what does it mean, bandwidth? latency? what is "good enough"; are we talking kB/s, MB/s, GB/s? 1 ms or 1000 ms round-trip?) Plus there are a lot of things not mentioned which can easily come into play, e.g. do you need authentication or encryption? Some frameworks give you such functionality, others rely on implementing that part of the puzzle yourself.
There probably exists no silver bullet product which is going to give you an ideal solution which optimizes all those requirements at the same time. As for the 'additional' component of your question - yes, if you restrict language requirements to python only, or further distinguish between key vs. nice-to-have requirements, there would be more solutions available.
One technology you might want to have a look at is Versile Python (full disclosure: I am one of the developers). It is multi-platform and supports python v2.6+/v3, and java SE6+. Regarding performance, it depends on what are your requirements. If you have any questions about the technology, just ask on the forum.
The solution is dbus
It is a mature solution and availiable for a lot of languages (C, Python, ..., just google for dbus + your favorite language), though not as fast as shared memory, but still fast enough for pretty much everything not requiring (hard) realtime properties.
I'll take a different tack here and say why not use the de facto RPC language of the Internet? I.e. HTTP REST APIs?
With Python Requests on the client side and Flask on the server side, you get these kinds of benefits:
Existing HTTP REST tools like Postman can access and test your server.
Those same tools can document your API.
If you also use JSON, then you get a lot of tooling that works with that too.
You get proven security practices and solutions (session based security and SSL).
It's a familiar pattern for a lot of different developers.

question comparing multiprocessing vs twisted

Got a situation where I'm going to be parsing websites. each site has to have it's own "parser" and possibly it's own way of dealing with cookies/etc..
I'm trying to get in my head which would be a better choice.
Choice I:
I can create a multiprocessing function, where the (masterspawn) app gets an input url, and in turn it spans a process/function within the masterspawn app that then handles all the setup/fetching/parsing of the page/URL.
This approach would have one master app running, and it in turn creates multiple instances of the internal function.. Should be fast, yes/no?
Choice II:
I could create a "Twisted" kind of server, that would essentially do the same thing as Choice I. The difference being that using "Twisted" would also impose some overhead. I'm trying to evaluate Twisted, with regards to it being a "Server" but i don't need it to perform the fetching of the url.
Choice III:
I could use scrapy. I'm inclined not to go this route as I don't want/need to use the overhead that scrapy appears to have. As i stated, each of the targeted URLs needs its own parse function, as well as dealing with the cookies...
My goal is to basically have the "architected" solution spread across multiple boxes, where each client box interfaces with a master server that allocates the urls to be parsed.
thanks for any comments on this..
-tom
There are two dimensions to this question: concurrency and distribution.
Concurrency: either Twisted or multiprocessing will do the job of concurrently handling fetching/parsing jobs. I'm not sure though where your premise of the "Twisted overhead" comes from. On the contrary, the multiprocessing path would incur much more overhead, since a (relatively heavy-weight) OS-process would have to be spawned. Twisteds' way of handling concurrency is much more light-weight.
Distribution: multiprocessing won't distribute your fetch/parse jobs to different boxes. Twisted can do this, eg. using the AMP protocol building facilities.
I cannot comment on scrapy, never having used it.
For this particular question I'd go with multiprocessing - it's simple to use and simple to understand. You don't particularly need twisted, so why take on the extra complication.
One other option you might want to consider: use a message queue. Have the master drop URLs onto a queue (eg. beanstalkd, resque, 0mq) and have worker processes pickup the URLs and process them. You'll get both concurrency and distribution: you can run workers on as many machines as you want.

How should I share and store data in a small multithreaded python application?

I'm writing a small multithreaded client-side python application that contains a small webserver (only serves page to the localhost) and a daemon. The webserver loads and puts data into a persistent "datastore", and the daemon processes this data, modifies it and adds some more. It should also takes care of the synchronization with the disk.
I'd like to avoid complicated external things like SQL or other databases as much as possible.
What are good and simple ways to design the datastore? Bonus points if your solution uses only standard python.
What you're seeking isn't too Python specific, because AFAIU you want to communicate between two different processes, which are only incidentally written in Python. If this indeed is your problem, you should look for a general solution, not a Python-specific one.
I think that a simple No-SQL key-value datastore such as Redis, for example, could be a very nice solution for your situation. Contrary to "complicated" using a tool designed specifically for such a purpose will actually make your code simpler.
If you insist on a Python-only solution, then consider using the Python bindings for SQLite which come pre-installed with Python. An SQLite DB can be concurrently used by two processes in a safe manner, as long as your semantics of data access are well defined (i.e. problems you have to solve anyway, the tool nonwithstanding).

Python - question regarding the concurrent use of `multiprocess`

I want to use Python's multiprocessing to do concurrent processing without using locks (locks to me are the opposite of multiprocessing) because I want to build up multiple reports from different resources at the exact same time during a web request (normally takes about 3 seconds but with multiprocessing I can do it in .5 seconds).
My problem is that, if I expose such a feature to the web and get 10 users pulling the same report at the same time, I suddenly have 60 interpreters open at the same time (which would crash the system). Is this just the common sense result of using multiprocessing, or is there a trick to get around this potential nightmare?
Thanks
If you're really worried about having too many instances you could think about protecting the call with a Semaphore object. If I understand what you're doing then you can use the threaded semaphore object:
from threading import Semaphore
sem = Semaphore(10)
with sem:
make_multiprocessing_call()
I'm assuming that make_multiprocessing_call() will cleanup after itself.
This way only 10 "extra" instances of python will ever be opened, if another request comes along it will just have to wait until the previous have completed. Unfortunately this won't be in "Queue" order ... or any order in particular.
Hope that helps
You are barking up the wrong tree if you are trying to use multiprocess to add concurrency to a network app. You are barking up a completely wrong tree if you're creating processes for each request. multiprocess is not what you want (at least as a concurrency model).
There's a good chance you want an asynchronous networking framework like Twisted.
locks are only ever nessecary if you have multiple agents writing to a source. If they are just accessing, locks are not needed (and as you said defeat the purpose of multiprocessing).
Are you sure that would crash the system? On a web server using CGI, each request spawns a new process, so it's not unusual to see thousands of simultaneous processes (granted in python one should use wsgi and avoid this), which do not crash the system.
I suggest you test your theory -- it shouldn't be difficult to manufacture 10 simultaneous accesses -- and see if your server really does crash.

Categories