Scrapy Middleware return Response - python

When using a Scrapy Downloader Middleware, and you don't find what you need. Do you build a Response object and return that or return the responsevariable passed in with process_response?
I tried the latter but kept getting response has no attribute selector when used with FilesPipeline.
class CaptchaMiddleware(object):
def process_response(self, request, response, spider):
download_path = spider.settings['CAPTCHA_STORE']
# 1
captcha_images = parse_xpath(response, CAPTCHA_PATTERN, 'image')
if captcha_images:
for url in captcha_images:
url = response.urljoin(url)
print("Downloading %s" % url)
download_file(url, os.path.join(download_path, url.split('/')[-1]))
for image in os.listdir(download_path):
Image.open(image)
# 2
return response
If I return at #1, the FilesPipeline runs properly and download the files but if I return at #2, it returns an error response has no attribute selector

From the docs:
process_response(request, response, spider) process_response() should
either: return a Response object, return a Request object or raise a
IgnoreRequest exception.
If it returns a Response (it could be the same given response, or a
brand-new one), that response will continue to be processed with the
process_response() of the next middleware in the chain.
If it returns a Request object, the middleware chain is halted and the
returned request is rescheduled to be downloaded in the future. This
is the same behavior as if a request is returned from
process_request().
If it raises an IgnoreRequest exception, the errback function of the
request (Request.errback) is called. If no code handles the raised
exception, it is ignored and not logged (unlike other exceptions).

From the docs at https://doc.scrapy.org/en/latest/topics/request-response.html#textresponse-objects:
TextResponse objects adds encoding capabilities to the base Response
class, which is meant to be used only for binary data, such as images,
sounds or any media file.
Bare Response objects do not have a selector attribute, TextResponse and subclasses do:
In [1]: from scrapy.http import Response, TextResponse
In [2]: Response('http://example.org', body=b'<html><body><div>Something</div></body></html>').selector
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-6fdd116632d2> in <module>
----> 1 Response('http://example.org', body=b'<html><body><div>Something</div></body></html>').selector
AttributeError: 'Response' object has no attribute 'selector'
In [3]: TextResponse('http://example.org', body=b'<html><body><div>Something</div></body></html>').selector
Out[3]: <Selector xpath=None data='<html><body><div>Something</div></body><'>
I don't see a new response being created in the code, but from the beginning of the question ("Do you build a Response object and return that (...)") I suspect the snippet might be incomplete, and the response returned at #2 could be a manually created Response.

Related

Stop urllib.request from raising exceptions on HTTP errors

Python's urllib.request.urlopen() will raise an exception if the HTTP status code of the request is not OK (e.g., 404).
This is because the default opener uses the HTTPDefaultErrorHandler class:
A class which defines a default handler for HTTP error responses; all responses are turned into HTTPError exceptions.
Even if you build your own opener, it (un)helpfully includes the HTTPDefaultErrorHandler for you implicitly.
If, however, you don't want Python to raise an exception if you get a non-OK response, it's unclear how to disable this behavior.
If you build your own opener with build_opener(), the documentation notes, emphasis added,
Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ... HTTPDefaultErrorHandler ...
Therefore, we need to make our own subclass of HTTPDefaultErrorHandler that does not raise an exception and simply passes the response through the pipeline unmodified. Then build_opener() will use our error handler instead of the default one.
import urllib.request
class NonRaisingHTTPErrorProcessor(urllib.request.HTTPErrorProcessor):
http_response = https_response = lambda self, request, response: response
opener = urllib.request.build_opener(NonRaisingHTTPErrorProcessor)
response = opener.open('http://example.com/doesnt-exist')
print(response.status) # prints 404
This answer (including the code sample) was not written by ChatGPT, but it did point out the solution.

How to mock requests methods called dynamically using getattr

I have class which called requests method using getattr like this:
import requests
class CustomRequests(object):
def __init__(self):
pass
def _do_requests(self, method='GET', url='', expected_status=200):
make_request = getattr(requests, method.lower())
url = url if url else 'http://example.com'
try:
response = make_request(method, url=url)
except response.exceptions.RequestException as exception:
raise exception
if response.status_code != expected_status:
raise ValueError
def get(self, *args, **kwargs):
self._do_requests(method='GET', *args, **kwargs)
I am trying to test the api using mock and responses lib like this:
import responses
#responses.activate
def test_get_method(self):
responses.add('GET', url='http://test_this_api.com', status=200)
custom_request = CustomRequest()
response_data = custom_request.get(method='GET')
AssertIsNotNone(response_data)
Is there any better or right way to test this method.
Getting this error:
message = message.format(**values)
KeyError: 'method'
There's no need to use getattr. requests.get, requests.post, etc. are just convenience methods for requests.request, which lets you pass the HTTP method as a parameter:
requests.request('GET', url) # equivalent to requests.get(url)
Also:
Your try/except is pointless, since all you do is re-raise the exception.
It doesn't make sense to raise a ValueError when the response status doesn't match what you expected. ValueError is for "when a built-in operation or function receives an argument that has the right type but an inappropriate value." Create your own exception class, e.g. UnexpectedHTTPStatusError.
Since the whole point of your CustomRequests class seems to be to raise an exception when the status code of the response doesn't match what the user expected, your tests should assert that an exception was actually raised with assertRaises().

Test Requests for Django Rest Framework aren't parsable by its own Request class

I'm writing an endpoint to receive and parse GitHub Webhook payloads using Django Rest Framework 3. In order to match the payload specification, I'm writing a payload request factory and testing that it's generating valid requests.
However, the problem comes when trying to test the request generated with DRF's Request class. Here's the smallest failing test I could come up with - the problem is that a request generated with DRF's APIRequestFactory seems to not be parsable by DRF's Request class. Is that expected behaviour?
from rest_framework.request import Request
from rest_framework.parsers import JSONParser
from rest_framework.test import APIRequestFactory, APITestCase
class TestRoundtrip(APITestCase):
def test_round_trip(self):
"""
A DRF Request can be loaded into a DRF Request object
"""
request_factory = APIRequestFactory()
request = request_factory.post(
'/',
data={'hello': 'world'},
format='json',
)
result = Request(request, parsers=(JSONParser,))
self.assertEqual(result.data['hello'], 'world')
And the stack trace is:
E
======================================================================
ERROR: A DRF Request can be loaded into a DRF Request object
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/james/active/prlint/venv/lib/python3.4/site-packages/rest_framework/request.py", line 380, in __getattribute__
return getattr(self._request, attr)
AttributeError: 'WSGIRequest' object has no attribute 'data'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/james/active/prlint/prlint/github/tests/test_payload_factories/test_roundtrip.py", line 22, in test_round_trip
self.assertEqual(result.data['hello'], 'world')
File "/home/james/active/prlint/venv/lib/python3.4/site-packages/rest_framework/request.py", line 382, in __getattribute__
six.reraise(info[0], info[1], info[2].tb_next)
File "/home/james/active/prlint/venv/lib/python3.4/site-packages/django/utils/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/home/james/active/prlint/venv/lib/python3.4/site-packages/rest_framework/request.py", line 186, in data
self._load_data_and_files()
File "/home/james/active/prlint/venv/lib/python3.4/site-packages/rest_framework/request.py", line 246, in _load_data_and_files
self._data, self._files = self._parse()
File "/home/james/active/prlint/venv/lib/python3.4/site-packages/rest_framework/request.py", line 312, in _parse
parsed = parser.parse(stream, media_type, self.parser_context)
File "/home/james/active/prlint/venv/lib/python3.4/site-packages/rest_framework/parsers.py", line 64, in parse
data = stream.read().decode(encoding)
AttributeError: 'str' object has no attribute 'read'
----------------------------------------------------------------------
I'm obviously doing something stupid - I've messed around with encodings... realised that I needed to pass the parsers list to the Request to avoid the UnsupportedMediaType error, and now I'm stuck here.
Should I do something different? Maybe avoid using APIRequestFactory? Or test my built GitHub requests a different way?
More info
GitHub sends a request out to registered webhooks that has a X-GitHub-Event header and therefore in order to test my webhook DRF code I need to be able to emulate this header at test time.
My path to succeeding with this has been to build a custom Request and load a payload using a factory into it. This is my factory code:
def PayloadRequestFactory():
"""
Build a Request, configure it to look like a webhook payload from GitHub.
"""
request_factory = APIRequestFactory()
request = request_factory.post(url, data=PingPayloadFactory())
request.META['HTTP_X_GITHUB_EVENT'] = 'ping'
return request
The issue has arisen because I want to assert that PayloadRequestFactory is generating valid requests for various passed arguments - so I'm trying to parse them and assert their validity but DRF's Request class doesn't seem to be able to achieve this - hence my question with a failing test.
So really my question is - how should I test this PayloadRequestFactory is generating the kind of request that I need?
"Yo dawg, I heard you like Request, cos' you put a Request inside a Request" XD
I'd do it like this:
from rest_framework.test import APIClient
client = APIClient()
response = client.post('/', {'github': 'payload'}, format='json')
self.assertEqual(response.data, {'github': 'payload'})
# ...or assert something was called, etc.
Hope this helps
Looking at the tests for APIRequestFactory in
DRF, stub
views
are created and then run through that view - the output is inspected for expected results.
Therefore a reasonable, but slightly long solution, is to copy this strategy to
assert that the PayloadRequestFactory is building valid requests, before then
pointing that at a full view.
The test above becomes:
from django.conf.urls import url
from django.test import TestCase, override_settings
from rest_framework.decorators import api_view
from rest_framework.response import Response
from rest_framework.test import APIRequestFactory
#api_view(['POST'])
def view(request):
"""
Testing stub view to return Request's data and GitHub event header.
"""
return Response({
'header_github_event': request.META.get('HTTP_X_GITHUB_EVENT', ''),
'request_data': request.data,
})
urlpatterns = [
url(r'^view/$', view),
]
#override_settings(ROOT_URLCONF='github.tests.test_payload_factories.test_roundtrip')
class TestRoundtrip(TestCase):
def test_round_trip(self):
"""
A DRF Request can be loaded via stub view
"""
request_factory = APIRequestFactory()
request = request_factory.post(
'/view/',
data={'hello': 'world'},
format='json',
)
result = view(request)
self.assertEqual(result.data['request_data'], {'hello': 'world'})
self.assertEqual(result.data['header_github_event'], '')
Which passes :D

Downloader Middleware to ignore all requests to a certain URL in scrapy

I am trying to define a custom downloader middleware in Scrapy to ignore all requests to a particular URL (these requests are redirected from other URLs, so I can't filter them out when I generate the requests in the first place).
I have the following code, the idea of which is to catch this at the response processing stage (as I'm not exactly sure how requests redirecting to other requests works), check the URL, and if it matches the one I'm trying to filter out then return an IgnoreRequest exception, if not, return the response as usual so that it can continue to be processed.
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware:
def process_response(request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
return IgnoreRequest()
else:
return response
and I add this to the dict of middlewares:
DOWNLOADER_MIDDLEWARES = {
'acny.middlewares.CustomDownloaderMiddleware': 650
}
with a value of 650, which should - I think - make it run directly after the RedirectMiddleware.
However, when I run the crawler, I get an error saying:
ERROR: Error downloading <GET http://www.achurchnearyou.com/venue.php?V=00001>: process_response() got multiple values for keyword argument 'request'
This error is occurring on the very first page crawled, and I can't work out why it is occurring - I think I've followed what the manual said to do. What am I doing wrong?
I've found the solution to my own problem - it was a silly mistake with creating the class and method in Python. The code above needs to be:
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware(object):
def process_response(self, request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
raise IgnoreRequest()
else:
return response
That is, there needs to be a self parameter for the method as the first parameter, and the class needs to inherit from object.
If you know which requests are redirected to the problematic ones, how about something like:
def parse_requests(self, response):
....
meta = {'handle_httpstatus_list': [301, 302]}
callback = 'process_redirects'
yield Request(url, callback=callback, meta=meta, ...)
def process_redirects(self, response):
url = response.headers['location']
if url is no good:
return
else:
...
This way you avoid downloading useless responses.
And you can always define your own custom redirect middleware.

What does urllib2.Request(<url>) do and how do i print/view it

I'm trying to learn how urllib2 works and how it encapsulates its various components before sending out an actual request or response.
So far I have:
theurl = "www.example.com"
That obviously specifies the URL to look at.
req = urllib2.Request(theurl)
Don't know what this does, hence the question.
handle = urllib2.urlopen(req)
This one gets the page and does all the requests and responses required.
So my question is, what does urllib2.Request actually do?
To try and look at it to get an idea I tried
print req
and just got
<urllib2.Request instance at 0x123456789>
I also tried
print req.read()
and got:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib64/python2.4/urllib2.py, line 207, in `__`getattr`__`
raise AttributeError, attr
AttributeError: read
So I'm obviously doing something wrong. If anyone can help in one of both my questions that would be great.
The class "Request" you're asking about:
http://docs.python.org/library/urllib2.html#urllib2.Request
class urllib2.Request(url[, data][,
headers][, origin_req_host][,
unverifiable])
This class is an abstraction of a URL
request.
The function you actually want to make a request (which can accept a Request object or wrap one around a URL string you provice) constructing a Request object): http://docs.python.org/library/urllib2.html#urllib2.urlopen
urllib2.urlopen(url[, data][,timeout])
Open the URL url, which can be either a string or a Request object.
Example:
theurl = "www.example.com"
try:
resp = urllib2.urlopen(theurl)
print resp.read()
except IOError as e:
print "Error: ", e
Example 2 (with Request):
theurl = "www.example.com"
try:
req = urllib2.Request(theurl)
print req.get_full_url()
print req.get_method()
print dir(req) # list lots of other stuff in Request
resp = urllib2.urlopen(req)
print resp.read()
except IOError as e:
print "Error: ", e
urllib2.Request() looks like a function call, but isn't - it's an object constructor. It creates an object of type Request from the urllib2 module, documented here.
As such, it probably doesn't do anything except initialise itself. You can verify this by looking at the source code, which should be in your Python installation's lib directory (urllib2.py, at least in Python 2.x).
If you want to have the constructed URL in the Request object use :
print(urllib2.Request.get_full_url())

Categories