Flexible IPC solution for Python on Linux? [closed]

Flexible IPC solution for Python on Linux? [closed] - python

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm writing a program in Python for which I'm considering a local client-server model, but I am struggling to figure out the best way for the server to communicate with the client(s). A simple, canned solution would be best--I'm not looking to reinvent the wheel. Here are my needs for this program:
Runs on Linux
Server and clients are on the same system, so I don't need to go over a network.
Latency that's not likely to be annoying to an interactive user.
Multiple clients can connect to the same server.
Clients are started independently of the server and can connect/disconnect at any time.
The number of clients is measurable in dozens; I don't need to scale very high.
Clients can come in a few different flavors:
Stream readers - Reads a continuous stream of data (in practice, this is all text).
State readers - Reads some state information that updates every once in a while.
Writers - Sends some data to the server, receives some response each time.
Client type 1 seems simple enough; it's a unidirectional dumb pipe. Client type 2 is a bit more interesting. I want to avoid simply polling the server to check for new data periodically since that would add noticeable latency for the user. The server needs some way to signal to all and only the relevant clients when the state information is updated so that the client can receive the updated state from the server. Client type 3 must be bidirectional; it will send user-supplied data to the server and receive some kind of response after each send.
I've looked at Python's IPC page (http://docs.python.org/2/library/ipc.html), but I don't think any of those solutions are right for my needs. The subprocess module is completely inappropriate, and everything else is a bit more low-level than I'd like.
The similar question Efficient Python to Python IPC isn't quite the same; I don't need to transfer Python objects, I'm not especially worried about CPU efficiency for the number of clients I'll have, I only care about Linux, and none of the answers to that question are especially helpful to me anyway.
Update:
I cannot accept an answer that just points me at a framework/library/module/tool without actually giving an explanation of how it can be used for my three different server-client relationships. If you say, "All of this can be done with named pipes!" I will have to ask "How?" Code snippets would be ideal, but a high-level description of a solution can work too.

Have you already looked into ZeroMQ? It has excellent Python support, and the documented examples already cover your use cases.
It's easy to use on a single platform, single machine setup, but it can be very easily extended to a network.

Related

Efficient python chat server [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am now writing a unicast chat server model, the flow will be as follows:
Sender send out message to the chat server, in the message the server also specify the message recipient id
The chat server will route the message to the right client, based on the recipient id
I implemented the chat server model using python standard library asyncore. I found that the CPU goes up, once the client connect to the server (1% vs 24%). I believe the performance is limited by the looping of the handle_write function.
Is there a better (e.g. more efficient) framework to accomplish my chat server requirement?
Thanks in advance

Of course we'd need actual code to debug the problem. But what you're mainly asking is:
Is there a better (e.g. more efficient) framework to accomplish my chat implementation?
Yes. It's generally accepted that asyncore sucks. It's hard to use as well as being inefficient. It's especially bad on Windows, because select especially sucks on Windows.
So, yes, using a different framework will probably make things better.
Unfortunately, an SO question is not a good place to get recommendations for frameworks, but I can throw out a list of usual suspects: twisted, monocle, gevent, eventlet, tulip.
Alternatively, if you're not worried about scalability to more than a few dozen clients, just about performance at the small scale, using a thread per client (or even two threads, one for reads and one for writes) and blocking I/O is incredibly simple.
Finally, if you're staying up to date with Python 3.x, there's a good chance that 3.4 will have a new and improved async I/O module that's much more efficient and much easier to use than asyncore (and it will almost certainly be based on tulip). So… the best solution may be to get a time machine and go forward a few months. (Or, if you're a reader searching for this answer in the future, look in the standard library under IPC and guess which module is the new-and-improved async I/O module.)

I just read from a web, talking about the efficiency between different python web servers (Link).
I think I will use gevent as it is very efficient (seems).

Python Script - Improve Speed [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a working Python script that checks the 6,300 or so sites we have to ensure they are up by sending an HTTP request to each and measuring the response. Currently the script takes about 40 min to run completely, I was interested in possibly some other ways to speed up the script, two thoughts were either threading or multiple running instances.
This is the order of execution now:
MySQL query to get all of the active domains to scan (6,300 give or take)
Iterate through each domain and using urllib send an HTTP request to each
If the site doesn't return '200' then log the results
repeat until complete
This seems like it could possibly be sped up significantly with threading but I am not quite sure how that process flow would look since I am not familiar with threading.
If someone could offer a sample high-level process flow and any other pointers for working with threading or offer any other insights on how to improve the script in general it would be appreciated.

The flow would look something like this:
Create a domain Queue
Create a result Queue
MySQL query to get all of the active domains to scan
Put the domains in the domain Queue
Spawn a pool of worker threads
Run the threads
Each worker will get a domain from the domain Queue, send a request and put the result in the result Queue
Wait for the threads to finish
Get everything from the result Queue and log it
You'll probably want to tune the number of threads, thus the pool, and not just 6300 threads for every domain.

You can take a look at scrapy framework. It's made for web scraping. It's asynchronus build on twisted and pretty fast.
In your case you can just get list of domains to scrape, and only see if it will return 200 without actually scraping anything. It should be much faster.
Here's the link:
http://scrapy.org/

Threading is definitely what you need. It will remove the serialized nature of your algorithm, and since it is mostly IO bounded, you will gain a lot by sending HTTP requests in parallel.
Your flow would become:
MySQL query to get all of the active domains to scan (6,300 give or take)
Iterate through each domain and create a thread that will use urllib to send an HTTP request to each
Log the results in threads
You can make this algorithm better by creating a n worker threads with queues, and add domains to queues instead of creating one thread per each domain. I just wanted to make things a little bit easier for you since you're not familiar with threads.

I guees you should go for threading, taking under investigation the optimal number of processes to start in order to avoid killing your client. Python manual offers good examples by the way take a look here Download multiple pages concurrently?
and to urllib, threading, multiprocessing

Fastest Python concurrency framework [closed]

C or Python for performance and scalability in creating socket connections? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I did see this post but it does not answer my question: C/Python Socket Performance?
I have been tasked with creating an application that can create thousands of connections based on sockets. I can do this in Python but I want to have room for performance improvements. I know it's possible in Python because of my past projects, but I'm curious how much of a performance improvement this would be if I was to do this project in C (not C++)?

It really depends on what you're doing with the sockets.
The best generic answer is: Usually Python is good enough that it doesn't matter, but sometimes it's not.
The overhead in the time taken to create and connect sockets is minimal, and reading and writing isn't much worse. But that doesn't matter, that's pretty much never any significant time spent doing that anyway.
There are reactors and proactors for Python every bit as good as the general-purpose ones available for C (and half of the C libraries have Python bindings). If you're not doing much significant work beyond the sockets, this is often your main bottleneck. If you've got a very specific use pattern and/or very tightly specified hardware, you might be able to write a custom reactor or proactor that beats out anything general-purpose. In that case, you pretty much have to go with C, not Python.
But usually, you've got significant work to do beyond just manipulating sockets.
If that work is mostly independent and highly parallelizable, C obviously beats Python (because of the GIL), unless the jobs are heavy enough that you can multi-process them (and keep in mind that "heavy enough" can be pretty heavy on Windows platforms). Except, of course, that it's incredibly easy to screw up performance (not to mention stability) writing multi-threaded C code; really, something like Erlang or Haskell is probably a better bet here than either C or Python. (If you're about to say, "But we've got people who are experienced at C but they can't learn Haskell", then those people are probably not good enough programmers to write multi-threaded code.)
If that work is mostly memory copying between socket buffers, and you can deal with a tightly-specified system, you may be able to write C code that optimizes zero-copies, and there's no way to do that in Python.
But if it's mostly typical things like waiting on disk or serialized computation, then it scarcely matters how you write the socket-stuff, because it's going to end up waiting on the real code anyway.
So, without any more information, I'd go with Python, because the time you save getting things up and running and debugged vs. C can be spent optimizing or otherwise improving whatever turns out to matter.

If you're using the Windows platform, learn the one thread per core concept of IOCPs and stay away from using thread pools that entail a more or less one thread per socket usage.

Library or tool to download multiple files in parallel [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a python library or a command line tool for downloading multiple files in parallel. My current solution is to download the files sequentially which is slow. I know you can easily write a half-assed threaded solution in python, but I always run into annoying problem when using threading. It is for polling a large number of xml feeds from websites.
My requirements for the solution are:
Should be interruptable. Ctrl+C should immediately terminate all downloads.
There should be no leftover processes that you have to kill manually using kill, even if the main program crashes or an exception is thrown.
It should work on Linux and Windows too.
It should retry downloads, be resilient against network errors and should timeout properly.
It should be smart about not hammering the same server with 100+ simultaneous downloads, but queue them in a sane way.
It should handle important http status codes like 301, 302 and 304. That means that for each file, it should take the Last-Modified value as input and only download if it has changed since last time.
Preferably it should have a progress bar or it should be easy to write a progress bar for it to monitor the download progress of all files.
Preferably it should take advantage of http keep-alive to maximize the transfer speed.
Please don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.
I guess I should describe what I want it for too... I have about 300 different data feeds as xml formatted files served from 50 data providers. Each file is between 100kb and 5mb in size. I need to poll them frequently (as in once every few minutes) to determine if any of them has new data I need to process. So it is important that the downloader uses http caching to minimize the amount of data to fetch. It also uses gzip compression obviously.
Then the big problem is how to use the bandwidth in an as efficient manner as possible without overstepping any boundaries. For example, one data provider may consider it abuse if you open 20 simultaneous connections to their data feeds. Instead it may be better to use one or two connections that are reused for multiple files. Or your own connection may be limited in strange ways.. My isp limits the number of dns lookups you can do so some kind of dns caching would be nice.

You can try pycurl, though the interface is not easy at first, but once you look at examples, its not hard to understand. I have used it to fetch 1000s of web pages in parallel on meagre linux box.
You don't have to deal with threads, so it terminates gracefully, and there are no processes left behind
It provides options for timeout, and http status handling.
It works on both linux and windows.
The only problem is that it provides a basic infrastructure (basically just a python layer above the excellent curl library). You will have to write few lines to achieve the features as you want.

There are lots of options but it will be hard to find one which fits all your needs.
In your case, try this approach:
Create a queue.
Put URLs to download into this queue (or "config objects" which contain the URL and other data like the user name, the destination file, etc).
Create a pool of threads
Each thread should try to fetch a URL (or a config object) from the queue and process it.
Use another thread to collect the results (i.e. another queue). When the number of result objects == number of puts in the first queue, then you're finished.
Make sure that all communication goes via the queue or the "config object". Avoid accessing data structures which are shared between threads. This should save you 99% of the problems.

I don't think such a complete library exists, so you'll probably have to write your own. I suggest taking a look at gevent for this task. They even provide a concurrent_download.py example script. Then you can use urllib2 for most of the other requirements, such as handling HTTP status codes, and displaying download progress.

I would suggest Twisted, although it is not a ready made solution, but provides the main building blocks to get every feature you listed in an easy way and it does not use threads.
If you are interested, take a look at the following links:
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#getPage
http://twistedmatrix.com/documents/current/api/twisted.web.client.html#downloadPage
As per your requirements:
Supported out of the box
Supported out of the box
Supported out of the box
Timeout supported out of the box, other error handling done through deferreds
Achieved easily using cooperators (example 7)
Supported out of the box
Not supported, solutions exists (and they are not that hard to implement)
Not supported, it can be implemented (but it will be relatively hard)

Nowadays there are excellent Python libs you might want to use - urllib3 and requests

Try using aria2 through simple python subprocess module.
It provide all requirements from your list, except 7, out of the box, and 7 is easy to write.
aria2c has a nice xml-rpc or json-rpc interface to interact with it from your scripts.

Does urlgrabber fit your requirements?
http://urlgrabber.baseurl.org/
If it doesn't, you could consider volunteering to help finish it. Contact the authors, Michael Stenner and Ryan Tomayko.
Update: Googling for "parallel wget" yields these, among others:
http://puf.sourceforge.net/
http://www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget
It seems like you have a number of options to choose from.

I used the standard libs for that, urllib.urlretrieve to be precise. downloaded podcasts this way, via a simple thread pool, each using its own retrieve. I did about 10 simultanous connections, more should not be a problem. Continue a interrupted download, maybe not. Ctrl-C could be handled, I guess. Worked on Windows, installed a handler for progress bars. All in all 2 screens of code, 2 screens for generating the URLs to retrieve.

This seems pretty flexible:
http://keramida.wordpress.com/2010/01/19/parallel-downloads-with-python-and-gnu-wget/

Threading isn't "half-assed" unless you're a bad programmer. The best general approach to this problem is the producer / consumer model. You have one dedicated URL producer, and N dedicated download threads (or even processes if you use the multiprocessing model).
As for all of your requirements, ALL of them CAN be done with the normal python threaded model (yes, even catching Ctrl+C -- I've done it).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.