Storing response as json in Python requests-cache - python

I'm using requests-cache to cache http responses in human-readable format.
I've patched requests using the filesystem backend, and the the serializer to json, like so:
import requests_cache
requests_cache.install_cache('example_cache', backend='filesystem', serializer='json')
The responses do get cached as json, but the response's body is encoded (I guess using the cattrs library, as described here).
Is there a way to make requests-cache save responses as-is?

What you want to do makes sense, but it's a bit more complicated than it appears. The response files you see are representations of requests.Response objects. Response._content contains the original bytes received from the server. The wrapper methods and properties like Response.json() and Response.text will then attempt to decode that content. For a Response object to work correctly, it needs to have the original binary response body.
When requests-cache serializes that response as JSON, the binary content is encoded in Base85. That's why you're seeing encoded bytes instead of JSON there. To have everything including the response body saved in JSON, there are a couple options:
Option 1
Make a custom serializer. If you wanted to be able to modify response content and have those changes reflected in responses returned by requests-cache, this would probably be the best way to do it.
This may be become a bit convoluted, because you would have to:
Handle response content that isn't valid JSON, and save as encoded bytes instead
During deserialization, if the content was saved as JSON, convert it back into bytes to recreate the original Response object
It's doable, though. I could try to come up with an example later, if needed.
Option 2
Make a custom backend. It could extend FileCache and FileDict, and copy valid JSON content to a separate file. Here is a working example:
import json
from os.path import splitext
from requests import Response
from requests_cache import CachedSession, FileCache, FileDict
class JSONFileCache(FileCache):
"""Filesystem backend that copies JSON-formatted response content into a separate file
alongside the main response file
"""
def __init__(self, cache_name, **kwargs):
super().__init__(cache_name, **kwargs)
self.responses = JSONFileDict(cache_name, **kwargs)
class JSONFileDict(FileDict):
def __setitem__(self, key: str, value: Response):
super().__setitem__(key, value)
response_path = splitext(self._path(key))[0]
json_path = f'{response_path}_content.json'
# Will handle errors and skip writing if content can't be decoded as JSON
with self._try_io(ignore_errors=True):
content = json.dumps(value.json(), indent=2)
with open(json_path, mode='w') as f:
f.write(content)
Usage example:
custom_backend = JSONFileCache('example_cache', serializer='json')
session = CachedSession(backend=custom_backend)
session.get('https://httpbin.org/get')
After making a request, you will see a pair of files like:
example_cache/680f2a52944ee079.json
example_cache/680f2a52944ee079_content.json
That may not be exactly what you want, but it's the easiest option if you only need to read the response content and don't need to modify it.

Related

How to check file type for an image stored as url?

I have a list of urls which look more or less like this:
'https://myurl.com/images/avatars/cb55-f14b-455d1-9ac4w20190416075520341'
I'm trying to validate the image behind the url and check what image type (png, jpeg or other) it has and write back the image type into a new dataframe column imgType.
my code so far
import pandas as pd
import requests
df = pd.read_csv('/path/to/allLogo.csv')
urls = df.T.values.tolist()[4]
for x in urls:
#i'm stuck here... as the content doesn't seem to give me image type.
s=requests.get(url, verify=False).content
df["imgType"] =
df.to_csv('mypath/output.csv')
Could someone help me with this? thanks in advance
One possibility is to check response headers for 'Content-Type' - but it depends on the server what headers are sent back to the client (without knowing the real URL is hard to tell):
import requests
url = 'https://s3.amazonaws.com/github/ribbons/forkme_right_darkblue_121621.png'
response = requests.get(url)
# uncomment this to print all response headers:
# print(response.headers)
print(response.headers['Content-Type'])
Prints:
image/png
check what image type (png, jpeg or other)
If you manage to download it, either to disk (file) or memory (as bytes - .content of requests' response) then you might harness python built-in module imghdr, following way:
import imghdr
imgtype = imghdr.what("path/to/image.png") # testing file on disk
or
import requests
r = requests.get("url_of_image")
imgtype = imghdr.what(h=r.content) # testing
Keep in mind that imghdr does recognize limited set of image file format (see linked docs), however it should suffice if you are only interested in detecting png vs jpeg vs other.

Django Views: When is request.data a dict vs a QueryDict?

I have run into some trouble with the issue, that request.data sometimes is a dict (especially when testing) and sometimes a QueryDict instance (when using curl).
This is especially a problem because apparently there is a big difference when calling a view using curl like so:
curl -X POST --data "some_float=1.23456789012123123" "http://localhost:8000/myview"
Or using the django_webtest client like so:
class APIViewTest(WebTest):
def test_testsomething(self):
self.app.post(url=url, params=json.dumps({some_float=1.26356756467}))
And then casting that QueryDict to a dict like so
new_dict = dict(**request.data)
my_float = float(new_dict['some_float'])
Everything works fine in the tests, as there request.data is a dict, but in production the view crashes because new_dict['some_float'] is actually a list with one element, and not as expected a float.
I have considered fixing the issue like so:
if type(request.data) is dict:
new_dict = dict(**request.data)
else:
new_dict = dict(**request.data.dict())
which feels very wrong as the tests would only test line 2, and (some? all?) production code would run line 4.
So while I am wondering why QueryDict behaves in this way, I would rather know why and when response.data is a QueryDict in the first place. And how I can use django tests to simulate this behavior. Having different conditions for production and testing systems is always troublesome and sometimes unavoidable, but in this case I feel like it could be fixed. Or is this a specific issue related to django_webtest?
Your test isn't a reflection of your actual curl call.
In your test, you post JSON, which is then available as a dict from request.data. But your curl call posts standard form data, which is available as a QueryDict. This behaviour is managed by the parsers attribute of your view or the DEFAULT_PARSER_CLASSES settings - and further note that this is functionality specifically provided by django-rest-framework, not Django itself.
Really you should test the same thing as you are doing; either send JSON from curl or get your test to post form-data.
When your request content_type is "application/x-www-form-urlencoded", request.Data become QueryDict.
see FormParser class.
https://github.com/encode/django-rest-framework/blob/master/rest_framework/parsers.py
And
QueryDict has get lists method. but it can't fetch dict value.
convert name str to array.
<input name="items[name]" value="Example">
<input name="items[count]" value="5">
https://pypi.org/project/html-json-forms/
And define custom form paser.
class CustomFormParser(FormParser):
"""
Parser for form data.
"""
media_type = 'application/x-www-form-urlencoded'
def parse(self, stream, media_type=None, parser_context=None):
"""
Parses the incoming bytestream as a URL encoded form,
and returns the resulting QueryDict.
"""
parser_context = parser_context or {}
encoding = parser_context.get('encoding', settings.DEFAULT_CHARSET)
data = QueryDict(stream.read(), encoding=encoding)
return parse_json_form(data.dict()) # return dict
And overwite DEFAULT_PARSER_CLASSES.
https://www.django-rest-framework.org/api-guide/settings/#default_parser_classes

How do I use python requests to download a processed files?

I'm using Django 1.8.1 with Python 3.4 and i'm trying to use requests to download a processed file. The following code works perfect for a normal request.get command to download the exact file at the server location, or unprocessed file.
The file needs to get processed based on the passed data (shown below as "data"). This data will need to get passed into the Django backend, and based off the text pass variables to run an internal program from the server and output .gcode instead .stl filetype.
python file.
import requests, os, json
SERVER='http://localhost:8000'
authuser = 'admin#google.com'
authpass = 'passwords'
#data not implimented
##############################################
data = {FirstName:Steve,Lastname:Escovar}
############################################
category = requests.get(SERVER + '/media/uploads/9128342/141303729.stl', auth=(authuser, authpass))
#download to path file
path = "/home/bradman/Downloads/requestdata/newfile.stl"
if category.status_code == 200:
with open(path, 'wb') as f:
for chunk in category:
f.write(chunk)
I'm very confused about this, but I think the best course of action is to pass the data along with request.get, and somehow make some function to grab them inside my views.py for Django. Anyone have any ideas?
To use data in request you can do
get( ... , params=data)
(and you get data as parameters in url)
or
post( ... , data=data).
(and you send data in body - like HTML form)
BTW. some APIs need params= and data= in one request of GET or POST to send all needed information.
Read requests documentation

How to get HTML from URL that returns "junk" data?

I want to get the html source code of a given url. I had tried using this
import urllib2
url = 'http://mp3.zing.vn' # write the url here
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
print data
But the returned data is not in HTML format for some pages. I tried with another link like http://phuctrancs.info and it works (as this page is plain html based). I have also tried using BeautifulSoup library but it didn't work also. Any suggestion?
You're getting the HTML you expect, but it's compressed. I tried this URL by hand and got back a binary mess with this in the headers:
Content-Encoding: gzip
I saved the response body to a file and was able to gunzip it on the command line. You should also be able to decompress it in your program with the functions in the standard library's zlib module.
Update for anyone having trouble with zlib.decompress...
The compressed data you will get (or at least that I got in Python 2.6) apparently has a "gzip header and trailer" like you'd expect in *.gz files, while zlib.decompress expects a "zlib wrapper"... probably. I kept getting an unhelpful zlib.error exception:
Traceback (most recent call last):
File "./fixme.py", line 32, in <module>
text = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
The solution is entirely undocumented in the Python standard library, but can be found in Greg Hewgill's answer to a question about gzip streams: You have to feed zlib.decompress a wbits argument, created by adding a magic number to an undocumented module-level constant <grumble, mutter...>:
text = zlib.decompress(data, 16 + zlib.MAX_WBITS)
If you feel this isn't obfuscated enough, note that a 32 here would be every bit as magical as the 16.
The only hint of this is buried in the original zlib's manual, under the deflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 16 to windowBits to write a simple gzip header and trailer around the compressed data instead of a zlib wrapper.
...and the inflateInit2 function:
windowBits can also be greater than 15 for optional gzip decoding. Add 32 to windowBits to enable zlib and gzip decoding with automatic header detection, or add 16 to decode only the gzip format [...]
Note that the zlib.decompress docs explicitly tell you that you can't do this:
The default value is therefore the highest value, 15.
But this is... the opposite of true.
<fume, curse, rant...>
have you look into the response code, urllib2 may need you to handle the response such as 301 redirect and so on.
you should print the response code like:
data = usock.read()
if usock.getcode() != 200:
print "something unexpected"
updated:
if the response contains None-localized or none-readable text, then you might need to specify the request character set in the request header.
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.opener(urllib2.HTTPCookieProcessor(cookie))
opener.addheaders = [("Content-type: text/html; charset=UTF-8")]
urllib2.install_opener(opener)
PS: untested.
use beautiful soup from python
import request
from bs4 import BeautifulSoup
url = 'http://www.google.com'
r=request.get(url)
b=BeautifulSoup(r.text)
b will contain all the html tags and also provides you iteractor to traverse elements/tags. To know more link is https://pypi.python.org/pypi/beautifulsoup4/4.3.2

ValueError: need more than 1 value to unpack, PoolManager request

The following code in utils.py
manager = PoolManager()
data = json.dumps(dict) #takes in a python dictionary of json
manager.request("POST", "https://myurlthattakesjson", data)
Gives me ValueError: need more than 1 value to unpack when the server is run. Does this most likely mean that the JSON is incorrect or something else?
Your Json data needs to be URLencoded for it to be POST (or GET) safe.
# import parser
import urllib.parse
manager = PoolManager()
# stringify your data
data = json.dumps(dict) #takes in a python dictionary of json
# base64 encode your data string
encdata = urllib.parse.urlencode(data)
manager.request("POST", "https://myurlthattakesjson", encdata)
I believe in python3 they made some changes that the data needs to be binary. See unable to Post data to a login form using urllib python v3.2.1

Categories