Is there an easy way to cache things when using urllib2 that I am over-looking, or do I have to roll my own?
If you don't mind working at a slightly lower level, httplib2 (https://github.com/httplib2/httplib2) is an excellent HTTP library that includes caching functionality.
You could use a decorator function such as:
class cache(object):
def __init__(self, fun):
self.fun = fun
self.cache = {}
def __call__(self, *args, **kwargs):
key = str(args) + str(kwargs)
try:
return self.cache[key]
except KeyError:
self.cache[key] = rval = self.fun(*args, **kwargs)
return rval
except TypeError: # incase key isn't a valid key - don't cache
return self.fun(*args, **kwargs)
and define a function along the lines of:
#cache
def get_url_src(url):
return urllib.urlopen(url).read()
This is assuming you're not paying attention to HTTP Cache Controls, but just want to cache the page for the duration of the application
This ActiveState Python recipe might be helpful:
http://code.activestate.com/recipes/491261/
I've always been torn between using httplib2, which does a solid job of handling HTTP caching and authentication, and urllib2, which is in the stdlib, has an extensible interface, and supports HTTP Proxy servers.
The ActiveState recipe starts to add caching support to urllib2, but only in a very primitive fashion. It fails to allow for extensibility in storage mechanisms, hard-coding the file-system-backed storage. It also does not honor HTTP cache headers.
In an attempt to bring together the best features of httplib2 caching and urllib2 extensibility, I've adapted the ActiveState recipe to implement most of the same caching functionality as is found in httplib2. The module is in jaraco.net as jaraco.net.http.caching. The link points to the module as it exists at the time of this writing. While that module is currently part of the larger jaraco.net package, it has no intra-package dependencies, so feel free to pull the module out and use it in your own projects.
Alternatively, if you have Python 2.6 or later, you can easy_install jaraco.net>=1.3 and then utilize the CachingHandler with something like the code in caching.quick_test().
"""Quick test/example of CacheHandler"""
import logging
import urllib2
from httplib2 import FileCache
from jaraco.net.http.caching import CacheHandler
logging.basicConfig(level=logging.DEBUG)
store = FileCache(".cache")
opener = urllib2.build_opener(CacheHandler(store))
urllib2.install_opener(opener)
response = opener.open("http://www.google.com/")
print response.headers
print "Response:", response.read()[:100], '...\n'
response.reload(store)
print response.headers
print "After reload:", response.read()[:100], '...\n'
Note that jaraco.util.http.caching does not provide a specification for the backing store for the cache, but instead follows the interface used by httplib2. For this reason, the httplib2.FileCache can be used directly with urllib2 and the CacheHandler. Also, other backing caches designed for httplib2 should be usable by the CacheHandler.
I was looking for something similar, and came across "Recipe 491261: Caching and throttling for urllib2" which danivo posted. The problem is I really dislike the caching code (lots of duplication, lots of manually joining of file paths instead of using os.path.join, uses staticmethods, non very PEP8'sih, and other things that I try to avoid)
The code is a bit nicer (in my opinion anyway) and is functionally much the same, with a few additions - mainly the "recache" method (example usage can be seem here, or in the if __name__ == "__main__": section at the end of the code).
The latest version can be found at http://github.com/dbr/tvdb_api/blob/master/cache.py, and I'll paste it here for posterity (with my application specific headers removed):
#!/usr/bin/env python
"""
urllib2 caching handler
Modified from http://code.activestate.com/recipes/491261/ by dbr
"""
import os
import time
import httplib
import urllib2
import StringIO
from hashlib import md5
def calculate_cache_path(cache_location, url):
"""Checks if [cache_location]/[hash_of_url].headers and .body exist
"""
thumb = md5(url).hexdigest()
header = os.path.join(cache_location, thumb + ".headers")
body = os.path.join(cache_location, thumb + ".body")
return header, body
def check_cache_time(path, max_age):
"""Checks if a file has been created/modified in the [last max_age] seconds.
False means the file is too old (or doesn't exist), True means it is
up-to-date and valid"""
if not os.path.isfile(path):
return False
cache_modified_time = os.stat(path).st_mtime
time_now = time.time()
if cache_modified_time < time_now - max_age:
# Cache is old
return False
else:
return True
def exists_in_cache(cache_location, url, max_age):
"""Returns if header AND body cache file exist (and are up-to-date)"""
hpath, bpath = calculate_cache_path(cache_location, url)
if os.path.exists(hpath) and os.path.exists(bpath):
return(
check_cache_time(hpath, max_age)
and check_cache_time(bpath, max_age)
)
else:
# File does not exist
return False
def store_in_cache(cache_location, url, response):
"""Tries to store response in cache."""
hpath, bpath = calculate_cache_path(cache_location, url)
try:
outf = open(hpath, "w")
headers = str(response.info())
outf.write(headers)
outf.close()
outf = open(bpath, "w")
outf.write(response.read())
outf.close()
except IOError:
return True
else:
return False
class CacheHandler(urllib2.BaseHandler):
"""Stores responses in a persistant on-disk cache.
If a subsequent GET request is made for the same URL, the stored
response is returned, saving time, resources and bandwidth
"""
def __init__(self, cache_location, max_age = 21600):
"""The location of the cache directory"""
self.max_age = max_age
self.cache_location = cache_location
if not os.path.exists(self.cache_location):
os.mkdir(self.cache_location)
def default_open(self, request):
"""Handles GET requests, if the response is cached it returns it
"""
if request.get_method() is not "GET":
return None # let the next handler try to handle the request
if exists_in_cache(
self.cache_location, request.get_full_url(), self.max_age
):
return CachedResponse(
self.cache_location,
request.get_full_url(),
set_cache_header = True
)
else:
return None
def http_response(self, request, response):
"""Gets a HTTP response, if it was a GET request and the status code
starts with 2 (200 OK etc) it caches it and returns a CachedResponse
"""
if (request.get_method() == "GET"
and str(response.code).startswith("2")
):
if 'x-local-cache' not in response.info():
# Response is not cached
set_cache_header = store_in_cache(
self.cache_location,
request.get_full_url(),
response
)
else:
set_cache_header = True
#end if x-cache in response
return CachedResponse(
self.cache_location,
request.get_full_url(),
set_cache_header = set_cache_header
)
else:
return response
class CachedResponse(StringIO.StringIO):
"""An urllib2.response-like object for cached responses.
To determine if a response is cached or coming directly from
the network, check the x-local-cache header rather than the object type.
"""
def __init__(self, cache_location, url, set_cache_header=True):
self.cache_location = cache_location
hpath, bpath = calculate_cache_path(cache_location, url)
StringIO.StringIO.__init__(self, file(bpath).read())
self.url = url
self.code = 200
self.msg = "OK"
headerbuf = file(hpath).read()
if set_cache_header:
headerbuf += "x-local-cache: %s\r\n" % (bpath)
self.headers = httplib.HTTPMessage(StringIO.StringIO(headerbuf))
def info(self):
"""Returns headers
"""
return self.headers
def geturl(self):
"""Returns original URL
"""
return self.url
def recache(self):
new_request = urllib2.urlopen(self.url)
set_cache_header = store_in_cache(
self.cache_location,
new_request.url,
new_request
)
CachedResponse.__init__(self, self.cache_location, self.url, True)
if __name__ == "__main__":
def main():
"""Quick test/example of CacheHandler"""
opener = urllib2.build_opener(CacheHandler("/tmp/"))
response = opener.open("http://google.com")
print response.headers
print "Response:", response.read()
response.recache()
print response.headers
print "After recache:", response.read()
main()
This article on Yahoo Developer Network - http://developer.yahoo.com/python/python-caching.html - describes how to cache http calls made through urllib to either memory or disk.
#dbr: you may need to add also https responses caching with :
def https_response(self, request, response):
return self.http_response(request,response)
Related
I would like to run the code from the module requests_with_caching.py that uses the build in requests module from Python. I have 2 py.files in the samefolder (requests_with_caching.py + test.py). The library requests is installed.
I get an AttributeError: module 'requests' has no attribute 'requestURL'.
I don't get what I'm missing ... .
import requests
import json
PERMANENT_CACHE_FNAME = "permanent_cache.txt"
TEMP_CACHE_FNAME = "this_page_cache.txt"
def _write_to_file(cache, fname):
with open(fname, 'w') as outfile:
outfile.write(json.dumps(cache, indent=2))
def _read_from_file(fname):
try:
with open(fname, 'r') as infile:
res = infile.read()
return json.loads(res)
except:
return {}
def add_to_cache(cache_file, cache_key, cache_value):
temp_cache = _read_from_file(cache_file)
temp_cache[cache_key] = cache_value
_write_to_file(temp_cache, cache_file)
def clear_cache(cache_file=TEMP_CACHE_FNAME):
_write_to_file({}, cache_file)
def make_cache_key(baseurl, params_d, private_keys=["api_key"]):
"""Makes a long string representing the query.
Alphabetize the keys from the params dictionary so we get the same order each time.
Omit keys with private info."""
alphabetized_keys = sorted(params_d.keys())
res = []
for k in alphabetized_keys:
if k not in private_keys:
res.append("{}-{}".format(k, params_d[k]))
return baseurl + "_".join(res)
def get(baseurl, params={}, private_keys_to_ignore=["api_key"], permanent_cache_file=PERMANENT_CACHE_FNAME, temp_cache_file=TEMP_CACHE_FNAME):
full_url = requests.requestURL(baseurl, params)
cache_key = make_cache_key(baseurl, params, private_keys_to_ignore)
# Load the permanent and page-specific caches from files
permanent_cache = _read_from_file(permanent_cache_file)
temp_cache = _read_from_file(temp_cache_file)
if cache_key in temp_cache:
print("found in temp_cache")
# make a Response object containing text from the change, and the full_url that would have been fetched
return requests.Response(temp_cache[cache_key], full_url)
elif cache_key in permanent_cache:
print("found in permanent_cache")
# make a Response object containing text from the change, and the full_url that would have been fetched
return requests.Response(permanent_cache[cache_key], full_url)
else:
print("new; adding to cache")
# actually request it
resp = requests.get(baseurl, params)
# save it
add_to_cache(temp_cache_file, cache_key, resp.text)
return resp
import requests_with_caching
# it's not found in the permanent cache
res = requests_with_caching.get("https://api.datamuse.com/words?rel_rhy=happy", permanent_cache_file="datamuse_cache.txt")
print(res.text[:100])
# this time it will be found in the temporary cache
res = requests_with_caching.get("https://api.datamuse.com/words?rel_rhy=happy", permanent_cache_file="datamuse_cache.txt")
# This one is in the permanent cache.
res = requests_with_caching.get("https://api.datamuse.com/words?rel_rhy=funny", permanent_cache_file="datamuse_cache.txt")
The module requests_with_caching.py was written for Runestone by the University of Michigan in their Coursera course Data Collection and Processing with Python.
This module imports the requests module, and seems to use a special requests method called requestURL().
The thing is, the requests.requestURL() method used in the requests_with_caching module is particular to Runestone.
In fact the entire requests module was rewritten for Runestone because Runestone can't do API requests.
Take a look at the difference between Runestone's version of requests found in its src/lib. You will notice it's different than the requests module in your Python environment's site packages folder when you ran pip install requests.
You can view Runestone's rewritten requests module by running this in Runestone:
with open('src/lib/requests.py','r') as f:
module = f.read()
print(module)
I'd suggest looking at how the Runestone requests.requestURL() function was written and modify your copy of the requests_with_caching.py module to add this custom function.
There will be other changes to make as well to get requests_with_caching.py to work in your local python environment.
I am working with an API and parsing different informations on its response, and calling the parsed information in different functions / files.
The issue is that it will call quickInfo() multiple times as a result, creating multiple API requests, which is unwanted as there is a rate limit or cause performance issues (API response is very large).
I am trying to find a way to get the API once and then be able to use the content of the response in different situations.
I could make "reponse" a global variable but I read that it was bad programming and could cause memory leaks.
Simplified code is as follows:
FILE 1
def quickInfo(name):
response = requests.get('[website]/product/{}?token=No'.format(name), headers=headers, verify=False).json()
return response
def parsing(name):
r = quickInfo(name)
name = "{}".format(r["product"]["name"])
buyprice_raw = [i["buyprice"] for i in r["avgHistory"]]
buy_orders = "{:,}".format(r["product"]["buyorders"])
sell_orders = "{:,}".format(r["product"]["sellorders"])
return name, buyprice_raw, buy_orders, sell_orders
def charting(name):
buyprice, sellprice = parsing(name)
#code continues
FILE 2
name, price = parsing(name)
print(name +"'s price is of " + price) #sample use
#code continues
Thanks for your help!
The absolute easiest way would be to decorate quickInfo with the #functools.lru_cache() decorator, but you'll just have to be aware that it will only ever do a real request once per name (unless you clear the cache):
import functools
#functools.lru_cache()
def quickInfo(name):
response = requests.get(
"[website]/product/{}?token=No".format(name),
headers=headers,
verify=False,
)
response.raise_for_status()
return response.json()
I'm using python suds library to make a SOAP client based on a local wsdl file. My goal is to use Twisted as the backend so I query the SOAP servers in a asyncronous way.
I know this topic has been covered different times (here1, here2), but I still have some questions.
I've seen three different approaches to use twisted with suds:
a) Applying this patch to the suds library.
b) Use twisted-suds, which is a fork of suds.
c) Influenced by this post, I implemented Client_Async suds client using the twisted deferToThread operation, (fully working gist can be found here. I also implemented a Client_Sync suds client as well to do some benchmarks)
# Init approach c) code
from suds.client import Client as SudsClient
from twisted.internet.threads import deferToThread
class MyClient(SudsClient):
def handleFailure(self, f, key, stats):
stats.stop_stamp(error=True)
logging.error("%s. Failure: %s" % (key, str(f)))
def handleResult(self, result, key, stats):
stats.stop_stamp(error=False)
success, text, res = False, None, None
try:
success = result.MessageResult.MessageResultCode == 200
text = result.MessageResult.MessageResultText
res = result.FooBar
except Exception, err:
pass
logging.debug('%40s : %5s %10s \"%40s\"' % (key, success, text, res))
logging.debug('%40s : %s' % (key, self.last_sent()))
logging.debug('%40s : %s' % (key, self.last_received()))
def call(stats, method, service, key, *a, **kw):
stats.start_stamp()
logging.debug('%40s : calling!' % (key))
result = service.__getattr__(method)(*a, **kw)
return result
class Client_Async(MyClient):
""" Twisted based async client"""
def callRemote(self, stats, method, key, *args, **kwargs):
logging.debug('%s. deferring to thread...' % key)
d = deferToThread(call, stats, method, self.service, key, *args, **kwargs)
d.addCallback(self.handleResult, key, stats)
d.addErrback(self.handleFailure, key, stats)
return d
class Client_Sync(MyClient):
def callRemote(self, stats, method, key, *args, **kwargs):
result = None
try:
result = call(stats, method, self.service, key, *args, **kwargs)
except Exception, err:
self.handleFailure(err, key, stats)
else:
self.handleResult(result, key, stats)
# End approach c) code
Doing a small benchmark using the c) approach points the benefits of the Async model:
-- Sync model using Client_Sync of approach c).
# python soap_suds_client.py -t 200 --sync
Total requests:800/800. Success:794 Errors:6
Seconds elapsed:482.0
Threads used:1
-- Async model using Client_Async of approach c).
# python soap_suds_client.py -t 200
Total requests:800/800. Success:790 Errors:10
Seconds elapsed:53.0
Threads used:11
I haven't tested approaches a) or b), my question is:
What am I really gaining from them apart from the use of just one thread?
I'm using suds in my projects. I didn't have to do any patches, or use twisted-suds.
I'm using the 0.4.1-2 version of python-suds package (on ubuntu) and it comes with very usefull nosend option.
# This parses the wsdl file. The autoblend option you'd probably skip,
# its needed when name spaces are not strictly preserved (case for Echo Sign).
from suds import client
self._suds = client.Client('file://' + config.wsdl_path, nosend=True,
autoblend=True)
....
# Create a context for the call, example sendDocument() call. This doesn't yet
# send anything, only creates an object with the request and capable of parsing
# the response
context = self._suds.service.sendDocument(apiKey=....)
# Actually send the request. Use any web client you want. I actually use
# something more sophisticated, but below I put the example using
# standard twisted web client.
from twisted.web import client
d = client.getPage(url=context.client.location(),
postdata=str(context.envelope),
method='POST',
headers=context.client.headers())
# The callback() of the above Deferred is fired with the body of the
# http response. I parse it using the context object.
d.addCallback(context.succeeded)
# Now in the callback you have the actual python object defined in
# your WSDL file. You can print...
from pprint import pprint
d.addCallback(pprint)
# I the response is a failure, your Deferred would be errbacked with
# the suds.WebFault exception.
I've written a custom Django file upload handler for my current project. It's a proof-of-concept which allows you to compute a hash of an uploaded file without storing that file on-disk. It's a proof of concept, to be sure, but if I can get it to work, I can get onto the real purpose of my work.
Essentially, here's what I have so far, which is working fine with one major exception:
from django.core.files.uploadhandler import *
from hashlib import sha256
from myproject.upload.files import MyProjectUploadedFile
class MyProjectUploadHandler(FileUploadHandler):
def __init__(self, *args, **kwargs):
super(MyProjectUploadHandler, self).__init__(*args, **kwargs)
def handle_raw_input(self, input_data, META, content_length, boundary,
encoding = None):
self.activated = True
def new_file(self, *args, **kwargs):
super(MyProjectUploadHandler, self).new_file(*args, **kwargs)
self.digester = sha256()
raise StopFutureHandlers()
def receive_data_chunk(self, raw_data, start):
self.digester.update(raw_data)
def file_complete(self, file_size):
return MyProjectUploadedFile(self.digester.hexdigest())
The custom upload handler works great. The hash is accurate and works without storing any of the uploaded file to disk and only uses 64kb of memory at any one time.
The only problem I'm having is that I need to access another field from the POST request before processing the file, a text salt input by the user. My form looks like this:
<form id="myForm" method="POST" enctype="multipart/form-data" action="/upload/">
<fieldset>
<input name="salt" type="text" placeholder="Salt">
<input name="uploadfile" type="file">
<input type="submit">
</fieldset>
</form>
The "salt" POST variable is only made available to me after the request has been processed and the file has been uploaded, which doesn't work for my use case. I can't seem to find a way to access this variable in any way, shape, or form in my upload handler.
Is there a way for me to access each multipart variable as it comes across instead of just accessing the filess which are uploaded?
My solution didn't come easy, but here it is:
class IntelligentUploadHandler(FileUploadHandler):
"""
An upload handler which overrides the default multipart parser to allow
simultaneous parsing of fields and files... intelligently. Subclass this
for real and true awesomeness.
"""
def __init__(self, *args, **kwargs):
super(IntelligentUploadHandler, self).__init__(*args, **kwargs)
def field_parsed(self, field_name, field_value):
"""
A callback method triggered when a non-file field has been parsed
successfully by the parser. Use this to listen for new fields being
parsed.
"""
pass
def handle_raw_input(self, input_data, META, content_length, boundary,
encoding = None):
"""
Parse the raw input from the HTTP request and split items into fields
and files, executing callback methods as necessary.
Shamelessly adapted and borrowed from django.http.multiparser.MultiPartParser.
"""
# following suit from the source class, this is imported here to avoid
# a potential circular import
from django.http import QueryDict
# create return values
self.POST = QueryDict('', mutable=True)
self.FILES = MultiValueDict()
# initialize the parser and stream
stream = LazyStream(ChunkIter(input_data, self.chunk_size))
# whether or not to signal a file-completion at the beginning of the loop.
old_field_name = None
counter = 0
try:
for item_type, meta_data, field_stream in Parser(stream, boundary):
if old_field_name:
# we run this test at the beginning of the next loop since
# we cannot be sure a file is complete until we hit the next
# boundary/part of the multipart content.
file_obj = self.file_complete(counter)
if file_obj:
# if we return a file object, add it to the files dict
self.FILES.appendlist(force_text(old_field_name, encoding,
errors='replace'), file_obj)
# wipe it out to prevent havoc
old_field_name = None
try:
disposition = meta_data['content-disposition'][1]
field_name = disposition['name'].strip()
except (KeyError, IndexError, AttributeError):
continue
transfer_encoding = meta_data.get('content-transfer-encoding')
if transfer_encoding is not None:
transfer_encoding = transfer_encoding[0].strip()
field_name = force_text(field_name, encoding, errors='replace')
if item_type == FIELD:
# this is a POST field
if transfer_encoding == "base64":
raw_data = field_stream.read()
try:
data = str(raw_data).decode('base64')
except:
data = raw_data
else:
data = field_stream.read()
self.POST.appendlist(field_name, force_text(data, encoding,
errors='replace'))
# trigger listener
self.field_parsed(field_name, self.POST.get(field_name))
elif item_type == FILE:
# this is a file
file_name = disposition.get('filename')
if not file_name:
continue
# transform the file name
file_name = force_text(file_name, encoding, errors='replace')
file_name = self.IE_sanitize(unescape_entities(file_name))
content_type = meta_data.get('content-type', ('',))[0].strip()
try:
charset = meta_data.get('content-type', (0, {}))[1].get('charset', None)
except:
charset = None
try:
file_content_length = int(meta_data.get('content-length')[0])
except (IndexError, TypeError, ValueError):
file_content_length = None
counter = 0
# now, do the important file stuff
try:
# alert on the new file
self.new_file(field_name, file_name, content_type,
file_content_length, charset)
# chubber-chunk it
for chunk in field_stream:
if transfer_encoding == "base64":
# base 64 decode it if need be
over_bytes = len(chunk) % 4
if over_bytes:
over_chunk = field_stream.read(4 - over_bytes)
chunk += over_chunk
try:
chunk = base64.b64decode(chunk)
except Exception as e:
# since this is anly a chunk, any error is an unfixable error
raise MultiPartParserError("Could not decode base64 data: %r" % e)
chunk_length = len(chunk)
self.receive_data_chunk(chunk, counter)
counter += chunk_length
# ... and we're done
except SkipFile:
# just eat the rest
exhaust(field_stream)
else:
# handle file upload completions on next iteration
old_field_name = field_name
except StopUpload as e:
# if we get a request to stop the upload, exhaust it if no con reset
if not e.connection_reset:
exhaust(input_data)
else:
# make sure that the request data is all fed
exhaust(input_data)
# signal the upload has been completed
self.upload_complete()
return self.POST, self.FILES
def IE_sanitize(self, filename):
"""Cleanup filename from Internet Explorer full paths."""
return filename and filename[filename.rfind("\\")+1:].strip()
Essentially, by subclassing this class, you can have a more... intelligent upload handler. Fields will be announced with the field_parsed method to subclasses, as I needed for my purposes.
I've reported this as a feature request to the Django team, hopefully this functionality becomes a part of the regular toolbox in Django, rather than monkey-patching the source code as done above.
Based on the code for FileUploadHandler, found here at line 62:
https://github.com/django/django/blob/master/django/core/files/uploadhandler.py
It looks like the request object is passed into the handler and stored as self.request
In that case you should be able to access the salt at any point in your upload handler by doing
salt = self.request.POST.get('salt')
Unless I'm misunderstanding your question.
import urllib, urllib2, json
def make_request(method, base, path, params):
if method == 'GET':
return json.loads(urllib2.urlopen(base+path+"?"+urllib.urlencode(params)).read())
elif method == 'POST':
return json.loads(urllib2.urlopen(base+path, urllib.urlencode(params)).read())
api_key = "5f1d5cb35cac44d3b"
print make_request("GET", "https://indit.ca/api/", "v1/version", {"api_key": api_key})
This set of code returns should return back the version and status like {status: 'ok', version: '1.1.0'}
What code do I need to add to print that response ?
It's hard to tell what the problem is without a complete, otherwise-working example (I can't even resolve host indit.ca), but I can explain how you can debug this yourself. Break it down step by step:
import urllib, urllib2, json
def make_request(method, base, path, params):
if method == 'GET':
url = base+path+"?"+urllib.urlencode(params)
print 'url={}'.format(url)
req = urllib2.urlopen(url)
print 'req={}'.format(req)
body = req.read()
print 'body={}'.format(body)
obj = json.loads(body)
print 'obj={}'.format(obj)
return obj
elif method == 'POST':
# You could do the same here, but your test only uses "GET"
return json.loads(urllib2.urlopen(base+path, urllib.urlencode(params)).read())
api_key = "5f1d5cb35cac44d3b"
print make_request("GET", "https://indit.ca/api/", "v1/version", {"api_key": api_key})
Now you can see where it goes wrong. Is it generating the right URL? (What happens if you paste that URL into a browser address bar, or a wget or curl command line?) Does urlopen return the kind of object you expected? Does the body look right? And so on.
Ideally, this will solve the problem for you. If not, at least you'll have a much more specific question to ask, and are much more likely to get a useful answer.