How to simultaneously query two APIs in Python? - python

Using web.py, I'm building a website in which I display search results from two third party websites through their public API. Unfortunately, for the APIs to send back the result takes about 4 seconds. If I query the second API only after I received the answer from the first, this obviously takes me about 8 seconds, which is way too long. To bring this down I want to send the requests to the APIs simultaneously and simply continue as soon as I received an answer from both the APIs.
My problem is now: how to do this?
I've never worked with parallel computing, but I've heard of multiprocessing and threading. I don't really know what the difference or advantages of each are. I also know that for example C++ is able to do parallel computations. It could therefore also be an option to write the part that queries the APIs in C++ (I'm a beginner in C++, but I think I'd manage). Finally, there could of course be options that I am totally overlooking. Maybe web.py has some options to do this, or maybe there are Python modules which are specifically made to do this?
Since only researching and understanding all of these options would take me quite a lot of time, I thought I'd ask you guys here for some tips.
So which one do you think I should go for? And most importantly: why? All tips are welcome!

You want an asynchronous HTTP request library. Examples of this would be gevent, or grequests.
Alternatively, you could use Python's built-in threading module to run synchronous requests in multiple threads.
Either way, no need to go to another language.

Related

Python - Sending HTTP request at same time

Hey guys I have looked at the best ways to do this. I want to make sure I understand correctly.
If I want:
http1
http2
http3
http...
To be sent at exactly the same time. I should set these up into a thread and then start the thread? I need to make sure it is exact same time.
I think this can be done in Java but Im not familiar with it. Thanks guys for any help you can give!
After reading up more on the process I dont think this was super clear. Will the async processing send these packets at the same time so they arrive at the destination at the same time? From reading different articles it seems like the async is just that.
I believe for what Im looking for, I would need to use a synchronous method like multiprocessing.
Thoughts?
You question is not totally clear to me but have you looked at Twisted? It's an event-driven networking engine written in Python. If you're not familiar with event-driven programming, this Linux Journal is a good introduction. Basically instead of threads, Asynchronous I/O is used with the reactor pattern (which encapsulates an event loop).
Twisted has multiple web clients. You should probably start with the newer one, called Agent (twisted.web.client.Agent), rather then the older getPage.
If you want to understand Twisted, I can recommend Dave Peticolas's Twisted Introduction. It's long but accessible and detailed.

Alternatives to ApacheBench for profiling my code speed

I've done some experiments using Apache Bench to profile my code response times, and it doesn't quite generate the right kind of data for me. I hope the good people here have ideas.
Specifically, I need a tool that
Does HTTP requests over the network (it doesn't need to do anything very fancy)
Records response times as accurately as possible (at least to a few milliseconds)
Writes the response time data to a file without further processing (or provides it to my code, if a library)
I know about ab -e, which prints data to a file. The problem is that this prints only the quantile data, which is useful, but not what I need. The ab -g option would work, except that it doesn't print sub-second data, meaning I don't have the resolution I need.
I wrote a few lines of Python to do it, but the httplib is horribly inefficient and so the results were useless. In general, I need better precision than pure Python is likely to provide. If anyone has suggestions for a library usable from Python, I'm all ears.
I need something that is high performance, repeatable, and reliable.
I know that half my responses are going to be along the lines of "internet latency makes that kind of detailed measurements meaningless." In my particular use case, this is not true. I need high resolution timing details. Something that actually used my HPET hardware would be awesome.
Throwing a bounty on here because of the low number of answers and views.
I have done this in two ways.
With "loadrunner" which is a wonderful but pretty expensive product (from I think HP these days).
With combination perl/php and the Curl package. I found the CURL api slightly easier to use from php. Its pretty easy to roll your own GET and PUT requests. I would also recommend manually running through some sample requests with Firefox and the LiveHttpHeaders add on to captute the exact format of the http requests you need.
JMeter is pretty handy. It has a GUI from which you can set up your requests and threadpools and it also can be run from the command line.
If you can code in Java, you can look at the combination of JUnitPerf + HttpUnit.
The downside is that you will have to do more things yourself. But at the price of this you will get unlimited flexibility and arguably more preciseness than with GUI tools, not to mention HTML parsing, JavaScript execution, etc.
There's also another project called Grinder which seems to be purposed for a similar task but I don't have any experience with it.
A good reference of opensource perfomance testing tools: http://www.opensourcetesting.org/performance.php
You will find descriptions and a "most popular" list
httperf is very powerful.
I've used a script to drive 10 boxes on the same switch to generate load by "replaying" requests to 1 server. I had my web app logging response time (server only) to the granularity I needed, but I didn't care about the response time to the client. I'm not sure you care to include the trip to and from the client in your calculations, but if you did it shouldn't be to difficult to code up. I then processed my log with a script which extracted the times per url and did scatter plot graphs, and trend graphs based on load.
This satisfied my requirements which were:
Real world distribution of calls to different urls.
Trending performance based on load.
Not influencing the web app by running other intensive ops on the same box.
I did controller as a shell script that foreach server started a process in the background to loop over all the urls in a file calling curl on each one. I wrote the log processor in Perl since I was doing more Perl at that time.

python web crawler with thread support

these day im making some web crawler script, but one of problem is my internet is very slow.
so i was thought whether is it possible webcrawler with multithreading by use mechanize or urllib or so.
if anyone have experience ,share info much appreciate.
i was look for in google ,but not found much useful info.
Thanks in advance
There's a good, simple example on this Stack Overflow thread.
Practical threaded programming with Python is worth reading.
Making multiple requests to many websites at the same time will certainly improve your results, since you don't have to wait for a result to arrive before sending new requests.
However threading is just one of the ways to do that (and a poor one, I might add). Don't use threading for that. Just don't wait for the response before sending another request! No need for threading to do that.
A good idea is to use scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It is written in python and can make many concurrent connections to fetch data at the same time (without using threads to do so). It is really fast. You can also study it to see how it is implemented.

What are my options for doing multithreaded/concurrent programming in Python?

I'm writing a simple site spider and I've decided to take this opportunity to learn something new in concurrent programming in Python. Instead of using threads and a queue, I decided to try something else, but I don't know what would suit me.
I have heard about Stackless, Celery, Twisted, Tornado, and other things. I don't want to have to set up a database and the whole other dependencies of Celery, but I would if it's a good fit for my purpose.
My question is: What is a good balance between suitability for my app and usefulness in general? I have taken a look at the tasklets in Stackless but I'm not sure that the urlopen() call won't block or that they will execute in parallel, I haven't seen that mentioned anywhere.
Can someone give me a few details on my options and what would be best to use?
Thanks.
Tornado is a web server, so it wouldn't help you much in writing a spider. Twisted is much more general (and, inevitably, complex), good for all kinds of networking tasks (and with good integration with the event loop of several GUI frameworks). Indeed, there used to be a twisted.web.spider (but it was removed years ago, since it was unmaintained -- so you'll have to roll your own on top of the facilities Twisted does provide).
I must say that Twisted gets my vote.
Performing event-drive tasks is fairly straightforward in Twisted. Integration with other important system components such as GTK+ and DBus is very easy.
The HTTP client support is basic for now but improving (>9.0.0): see related question.
The added bonus is that Twisted is available in the Ubuntu default repository ;-)
For a quick look at package sizes, see
ohloh.net/p/compare .
Of course source size is only a rough metric (what I'd really like is nr pages doc, nr pages examples,
dependencies), but it can help.

Python Twisted and database connections

Our projects at work include synchronous applications (short lived) and asynchronous Twisted applications (long lived). We're re-factoring our database and are going to build an API module to decouple all of the SQL in that module. I'd like to create that API so both synchronous and asynchronous applications can use it. For the synchronous applications I'd like calls to the database API to just return data (blocking) just like using MySQLdb, but for the asynchronous applications I'd like calls to the same API functions/methods to be non-blocking, probably returning a deferred. Anyone have any hints, suggestions or help they might offer me to do this?
Thanks in advance,
Doug
twisted.enterprise.adbapi seems the way to go -- do you think it fails to match your requirements, and if so, can you please explain why?
Within Twisted, you basically want a wrapper around a function which returns a Deferred (such as the Twisted DB layer), waits for it's results, and returns them. However, you can't busy-wait, since that's using up your reactor cycles, and checking for a task to complete using the Twisted non-blocking wait is probably inefficient.
Will inlineCallbacks or deferredGenerator solve your problem? They require a modern Twisted. See the twistedmatrix docs.
def thingummy():
thing = yield makeSomeRequestResultingInDeferred()
print thing #the result! hoorj!
thingummy = inlineCallbacks(thingummy)
Another option would be to have two methods which execute the same SQL template, one which uses runInteraction, which blocks, and one which uses runQuery, which returns a Deferred, but that would involve more code paths which do the same thing.
Have you considered borrowing a page from continuation-passing style? Stackless Python supports continuations directly, if you're using it, and the approach appears to have gained some interest already.
All the database libraries I've seen seem to be stubbornly synchronous.
It appears that Twisted.enterprise.abapi solves this problem by using a threads to manage a connection pool and wrapping the underlying database libraries. This is obviously not ideal, but I suppose it would work, but I haven't actually tried it myself.
Ideally there would be some way to have sqlalchemy and twisted integrated. I found this project, nadbapi, which claims to do it, but it looks like it hasn't been updated since 2007.

Categories