Python oauth2client async - python

I am fighting with tornado and the official python oauth2client, gcloud... modules.
These modules accept an alternate http client passed with http=, as long as it has a method called request which can be called by any of these libraries, whenever an http request must be sent to google and/or to renew the access tokens using the refresh tokens.
I have created a simple class which has a self.client = AsyncHttpClient()
Then in its request method, returns self.client.fetch(...)
My goal is to be able to yield any of these libraries calls, so that tornado will execute them in asynchronously.
The thing is that they are highly dependant on what the default client - set to httplib2.Http() returns: (response, content)
I am really stuck and cannot find a clean way of making this async
If anyone already found a way, please help.
Thank you in advance

These libraries do not support asynchronous. The porting process is not always easy.
oauth2client
Depending on what you want to do maybe Tornado's GoogleOAuth2Mixin or tornado-alf will be enough.
gcloud
Since I am not aware of any Tornado/asyncio implementation of gcloud-python, so you could:
you may write it yourself. Again it's not simple transport change of Connection.http or request, all the stuff around must be able to use/yield future/coroutines.
wrap it in ThreadPoolExecutor (as #Apero mentioned). This is high level API, so any nested api calls within that yield will be executed in same thread (not using the pool). It could work well.
external app (with ProcessPoolExecutor or Popen).
When I had similar problem with AWS couple years ago, I've ended up with executing, asynchronously, CLI (Tornado + subprocess.Popen + some cli (awscli, or boto based)) and simple cases (like S3, basic EC2 operations) with plain AsyncHTTPClient.

Related

Caching Google API calls for unit tests

I've got a Google App Engine project that uses the Google Cloud Language API, and I'm using the Google API Client Library (Python) to make the API calls.
When running my unit tests, I make quite a few calls to the API. This slows down my testing and also incurs costs.
I'd like to cache the calls to the Google API to speed up my tests and avoid the API charges, and I'd rather not roll my own if another solution is available.
I found this Google API page, which suggests doing this:
import httplib2
http = httplib2.Http(cache=".cache")
And I've added these lines to my code (there is another option to use GAE memcache but won't be persisted between test code invocations) and right after these lines, I create my API call connection:
NLP = discovery.build("language", "v1", API_KEY)
The caching isn't working and the above solution seems too simple so I suspect I am missing something.
UPDATE:
I updated my tests so that App Engine is not used (just a regular unit test) and I also figured out that I can pass the http I created to the Google API client like this:
NLP = discovery.build("language", "v1", http, API_KEY)
Now, the initial discovery call is cached but the actual API calls are not cached,e.g., this call is not cached:
result = NLP.documents().annotateText(body=data).execute()
The suggested code:
http = httplib2.Http(cache=".cache") is trying to cache to the local filesystem in a directory called ".cache". On App Engine, you cannot write to the local filesystem, so this does nothing.
Instead, you could try caching to Memcache. The other suggestion on the Python Client docs referenced is to do exactly this:
from google.appengine.api import memcache
http = httplib2.Http(cache=memcache)
Since all App Engine apps get free access to shared memcache this should be better than nothing.
If this fails, you could also try memoization. I've had success memoizing calls to slow or flaky APIs, but it comes at the cost of increased memory usage (so I need bigger instances).
EDIT: I see from your comment you're having this problem locally. I was originally thinking that memoization would be an alternative, but the need to hack on httplib2 makes that overly complicated. I'm back to thinking about how to convince httplib2 to do the right thing.
If you're trying to make a test run faster by caching an API call result, stop and consider whether you may have taken a wrong turn.
If can you restructure your code such that you can replace the API call with a unittest.mock, your tests will run much, much faster.
I just came across vcrpy which seems to do exactly this. I'll update this answer after I've had a chance to try it out.

POST method for webhooks dropbox

I simply want to receive notifications from dropbox that a change has been made. I am currently following this tutorial:
https://www.dropbox.com/developers/reference/webhooks#tutorial
The GET method is done, verification is good.
However, when trying to mimic their implementation of POST, I am struggling because of a few things:
I have no idea what redis_url means in the def_process function of the tutorial.
I can't actually verify if anything is really being sent from dropbox.
Also any advice on how I can debug? I can't print anything from my program since it has to be ran on a site rather than an IDE.
Redis is a key-value store; it's just a way to cache your data throughout your application.
For example, access token that is received after oauth callback is stored:
redis_client.hset('tokens', uid, access_token)
only to be used later in process_user:
token = redis_client.hget('tokens', uid)
(code from https://github.com/dropbox/mdwebhook/blob/master/app.py as suggested by their documentation: https://www.dropbox.com/developers/reference/webhooks#webhooks)
The same goes for per-user delta cursors that are also stored.
However there are plenty of resources how to install Redis, for example:
https://www.digitalocean.com/community/tutorials/how-to-install-and-use-redis
In this case your redis_url would be something like:
"redis://localhost:6379/"
There are also hosted solutions, e.g. http://redistogo.com/
Possible workaround would be to use database for such purpose.
As for debugging, you could use logging facility for Python, it's thread safe and capable of writing output to file stream, it should provide you with plenty information if properly used.
More info here:
https://docs.python.org/2/howto/logging.html

Testing Python app which does HTTP requests

For testing I use pytest so it would be great if you suggest something pytest specific.
I have some code which uses the requests library. What it does is basically simple POST/GET requests for logging in, parsing data, etc.
Surely I want to test that code locally without doing any actual HTTP requests.
A monkeypatch funcarg could be the solution, but I think that mocking request.get(...) calls or directly pythons's urllib isn't good, because, for example, there are functions which do more than one HTTP request inside , so I can't just mock the request.get("anyURL") with a simple lambda *args, **kwaargs: """<html>response</html>""".
There are different URLs which should return different content. Sometimes it should be based on POST/GET data. Also I have no idea how will requests.session behave in case of direct mocking. Besides that how to emulate session termination? How to emulate a connection failure?
So in the end in my opinion it's quite hard to use monkey patching here. At least I am not able to write a good mocking function which will take into account everything. Also if I choose to mock urllib directly and someday requests library starts using something different all my tests will fail.
So the best way I think is to use actual HTTP server which turns on on a test run, and if possible takes into account pytest's scopes, etc (so it's a funcarg). While googling I found only two solutions:
https://pypi.python.org/pypi/pytest-localserver
https://github.com/kevin1024/pytest-httpbin
The first one sets up the HTTP server and serves predefined content over a specific URL. Definitely that does not work for me, because as I mentioned some functions which I intend to test do several requests so all inner HTTP requests.get() will get the same answer. Bad.
The second one as far a I see has the same problem. Or at least do not understand how to use it.
The third option could be writing a small Flask based service, but I guess I'll run into a problem that things I use in tests should be tested as well which is a bad practice.
You can rather unmock get after first call.
class Requester():
def get(*args):
...
def mock_get(requester, response):
orig_get = requester.get
def return_text_and_unmock(*args, **kwargs):
self.get = orig_get
return response
requester.get = return_text_and_unmock.__get__(requester, Requester)
return requester
I believe using a local server for unit testing is not a good idea as this is not really a unit test. I you're using requests one good way of being able to mock the requests is to use the module responses that is developed and maintained by dropbox: response dropbox. With responses you will be able to mock each request you make by specifying that you want a certain content to be return when a request is issued to a given URL. The README gives a quick overview of the module's abilities.

Can you hook pyramid into twisted, skipping the wsgi part?

Given that :
WSGI doesn't play very well with async.
Twisted ergonomics suck.
Pyramid is very clean and component oriented.
How could I use Pyramid and Twisted ?
I can imagine making a twisted protocol to get the raw HTML request. But then I can't see how to parse it into a pyramid request objects. All documented pyramid tools seems to expect some wsgi interface at some point.
I could use waitress code to parse the request and turn it into a WSGI env then pass the env to pyramid but that's a lot of work with many issues I'm sure I can't even imagine down the road.
I know twisted includes a WSGI server, but it implies synchronicity in the app code, which does not serve my purpose. I want to be able to use the request and response objects, renderers, routers, and others pyramid tools in a twisted asynchronous protocol, with an asynchronous, non blocking app code as well. Hence I won't want to use WSGI.
Twisted API is verbose, heavy and uninuitive compared to any other asynchronous toolkit you'll find in Python or even other languages. Hence the critic about its ergonomics. I can use it, but training newcomers in my team to do it has a high cost. I wish to lower it.
Indeed, it packs a lot of power that I want to use.
To elaborate on my needs, I'm building a tool using crossbar.io and cyclone to have a WAMP/HTTP framework a bit friendlier to my team that the current tools. But cyclone is not as complete as pyramid, and I was hoping pyramid components were decoupled enough that the WSGI paradigm was not enforced, so I could leverage the tremendous work they did on it. All I need is an entry point : somewhere to get the HTML, and parse it into a request objet, and somewhere to take a response object, and returns HTML to the client. I wish i don't have to write a protocol manually for this, http is tricky and I'm sure I'll get it wrong in many ways.
One precision : i don't wish to use the full pyramid framework, just some components here and there, such as rooting, cookie parsing, CSRF protection, etc. I won't use their view system for it assumes a synchronous API.
Looking at Pyramid, I can see that it expects the entire request be be parsed and turned into a request object. it also returns the response as an object as well. So a part of the problem, to hook twisted and pyramid together, is to :
get the http request text as one big chunk from twisted;
parse it into the request object somehow (couldn't find a simple function to do this, but if I can turn it into an WSGI environ + request object, pyramid can convert it to it's format).
get the pyramid response object and turn it into a generator of strings (an adaptor can be find since that's what WSGI does).
Send the response back with twisted from this generator of strings.
Alternatives can be to use something simpler than pyramid like werkzeug for the glue.
Twisted Web lets you interpret HTTP request bodies (regardless of content-type, HTML or otherwise) incrementally as they're received - but it doesn't make doing so very easy. There's a very old ticket that we never seem to make much progress on for improving this situation. Until it's resolved, there probably isn't a better answer than the one I'm about to give. This incremental HTTP request body delivery, I think, is what you're looking for here (because you said you expect requests to "be a big HTML chunk").
The hook for incremental request body handling is Request.handleContentChunk. You can see a complete demonstration of its use in my answer to Python server for streaming request body content.
This gives you the data as it arrives at the server. If you want to use Pyramid, you'll have to construct a Pyramid request that uses this data. Most of the initialization of the Pyramid request object should be straightforward (eg filling the environ dictionary with the request headers - you can take these from Request.requestHeaders). The slightly trickier part will be initializing the Pyramid request object's body - which is supposed to be a file-like object that provides synchronous access to the request body.
On the one hand, if you dispatch the request before the request body has been completely received then you avoid the cost of buffering the entire request body in memory. On the other hand, if you let application code begin to read the request body then you have to deal with the circumstance that it tries to read beyond the point in the data which has actually arrived at the server. This can be dealt with. The body file-like object is expected to present a blocking interface. All you have to do is block until the data is available.
Here's a brief (incomplete, not meant to actually work) sketch of what I mean:
# XXX Note: Queue is not actually thread-safe. Use a safer primitive.
from Queue import Queue
class Body(object):
def __init__(self):
self._buffer = Queue()
self._pending = b""
self._eof = False
def read(self, how_many):
if self._eof:
return b""
if self._pending == b"":
data = self._buffer.get()
if data is None:
self._eof = True
return b""
else:
self._pending = data
if self._pending is None:
result = self._pending[:how_many]
self._pending = self._pending[how_many:]
return result
def _add_data(self, data):
self._buffer.put(data)
You can create an instance of this type, initialize the Pyramid request object's body attribute with it, and then call _add_data on it in the Twisted Request class's handleContentChunk callback.
You could also implement this as an enhancement to Twisted's own WSGI server. For the sake of simplicity, Twisted's WSGI server does read the entire request body before dispatching the request to the WSGI application - but it doesn't have to. If this is the only problem with WSGI then it'd be better to improve the quality of the WSGI implementation and keep the interface rather than both implementing the improvement and stepping outside of the interface (tying you more closely to both Twisted and Pyramid - unnecessarily).
The second half of the problem, generating response bodies incrementally, shouldn't really be a problem. Twisted's WSGI container will write out response data as the WSGI application object yields it. Or if you use twisted.web.resource instead of the WSGI interface, you can call request.write as many times as you like, at any time you like (up until you call request.finish). The only trick is that if you want to do this you must return NOT_DONE_YET from the render method.

python facebook sdk call to facebook is slow when compared to command line curl

I am using facebook python Graph API. When i am calling put_object to write to news feed it is taking about 12-14 sec to complete the call. When i run from command line using curl with same parameters i get the response back in 1.2 seconds.
I ran the profiler on the python code and from i see that it is spending 99.5% time in the socket.recv . I am not sure if it is the problem with facebook python sdk or something else.
I am on python 2.6. i see from facebook.py that it is using urllib.
file = urllib.urlopen("https://graph.facebook.com/" + path + "?" +
urllib.urlencode(args), post_data)
Has someone experienced similar slow down ? Any suggestions will be highly appreciated.
Direct command-line CURL is bound to be faster than urllib or urllib2. If you want speed, you could replace the call using pycurl (which is also a C-extension) whereas urllib is a python module written on top of httplib.
What more you could do is, if you're flexible enough to use a Tornado server, use the async caller of Tornado which directly talks to sockets and is also asynchronous.
Also, if nothing out of these can be done, try replacing urllib with urllib2 and create a non blocking caller with callback returns. This is all that I've done to improve the native 3rd party wrappers of facebook/twitter/amazon etc.
Are you behind an http proxy server? Curl honors proxy server environment variables, while urllib doesn't do so by default, and also doesn't support calling an https url (such as https://graph.facebook.com) over a proxy server.
In any event I expect it's more likely a network issue than a Python vs C issue. Yes C is faster, but this isn't a CPU-bound task, and there's no way that you're burning 12-14 seconds inside the Python interpreter to make this call.
If curl is happy but urllib is not, perhaps trying pycurl will solve your problem. http://pycurl.sourceforge.net/

Categories