Is it possible to modify scrapy Response object in middleware? - python

I scrape some data from several EU sites and find that sometimes my calls to response.xpath() brokes text. For instance, I found that html entities like "& amp;" &#164 and other similar translated into broken bytes like \x92 or \xc3 etc.
I found some working solution - escape html entities before calls to xpath method (using lxml lib). Looks Like this:
body_str = str(response.body, response._body_declared_encoding())
unescaped_body = html.unescape(body_str)
response = response.replace(body=unescaped_body)
It seems to work fine for me if such code called immediately at start of callback for processing response.
What I'm trying to do now is to move this code into Spider Middleware, to use this approach for each request or in another spider etc. But problem is that this code doesn't modify response object inside
def process_spider_input(self, response, spider):
Seems that response = response.replace(...) creates new local variable response, which isn't used elsewhere.
And my question is in title: can I modify response object inside spider middleware or not?

I would say it is better to use a Downloader Middleware with the process_response method and return a Response object.
...
def process_response(self, request, response, spider):
...
body_str = str(response.body, response._body_declared_encoding())
unescaped_body = html.unescape(body_str)
new_response = response.replace(body=unescaped_body)
return new_response

Related

Flask Middleware with both Request and Response

I want to create a middleware function in Flask that logs details from the request and the response. The middleware should run after the Response is created, but before it is sent back. I want to log:
The request's HTTP method (GET, POST, or PUT)
The request endpoint
The response HTTP status code, including 500 responses. So, if an exception is raised in the view function, I want to record the resulting 500 Response before the Flask internals send it off.
Some options I've found (that don't quite work for me):
The before_request and after_request decorators. If I could access the request data in after_request, my problems still won't be solved, because according to the documentation
If a function raises an exception, any remaining after_request functions will not be called.
Deferred Request Callbacks - there is an after_this_request decorator described on this page, which decorates an arbitrary function (defined inside the current view function) and registers it to run after the current request. Since the arbitrary function can have info from both the request and response in it, it partially solves my problem. The catch is that I would have to add such a decorated function to every view function; a situation I would very much like to avoid.
#app.route('/')
def index():
#after_this_request
def add_header(response):
response.headers['X-Foo'] = 'Parachute'
return response
return 'Hello World!'
Any suggestions?
My first answer is very hacky. There's actually a much better way to achieve the same result by making use of the g object in Flask. It is useful for storing information globally during a single request. From the documentation:
The g name stands for “global”, but that is referring to the data being global within a context. The data on g is lost after the context ends, and it is not an appropriate place to store data between requests. Use the session or a database to store data across requests.
This is how you would use it:
#app.before_request
def gather_request_data():
g.method = request.method
g.url = request.url
#app.after_request
def log_details(response: Response):
g.status = response.status
logger.info(f'method: {g.method}\n url: {g.url}\n status: {g.status}')
return response
Gather whatever request information you want in the function decorated with #app.before_request and store it in the g object.
Access whatever you want from the response in the function decorated with #app.after_request. You can still refer to the information you stored in the g object from step 1. Note that you'll have to return the response at the end of this function.
you can use flask-http-middleware for it link
from flask import Flask
from flask_http_middleware import MiddlewareManager, BaseHTTPMiddleware
app = Flask(__name__)
class MetricsMiddleware(BaseHTTPMiddleware):
def __init__(self):
super().__init__()
def dispatch(self, request, call_next):
url = request.url
response = call_next(request)
response.headers.add("x-url", url)
return response
app.wsgi_app = MiddlewareManager(app)
app.wsgi_app.add_middleware(MetricsMiddleware)
#app.get("/health")
def health():
return {"message":"I'm healthy"}
if __name__ == "__main__":
app.run()
Every time you make request, it will pass the middleware
Okay, so the answer was staring me in the face the whole time, on the page on Deferred Request Callbacks.
The trick is to register a function to run after the current request using after_this_request from inside the before_request callback. This is the code snippet of what worked for me:
#app.before_request
def log_details():
method = request.method
url = request.url
#after_this_request
def log_details_callback(response: Response):
logger.info(f'method: {method}\n url: {url}\n status: {response.status}')
These are the steps:
Get the required details from the response in the before_request callback and store them in some variables.
Then access what you want of the response in the function you decorate with after_this_request, along with the variables you stored the request details in earlier.

Scrapy : Sending information to prior function

I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:
request.meta['redirect_urls']
My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:
def test1(self, response):
......
for row in empties: # 100 records
d = object_as_dict(row)
AA
yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
print str(response.meta['redirect_urls'])
BB
d = response.meta['d']
So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?
Why not use a DownloaderMiddleware?
You could write a DownloaderMiddleware like so:
Edit: I have edited the original code to address a second problem the OP had in the comments.
from scrapy.http import Request
class CustomMiddleware():
def process_response(self, request, response, spider):
if 'redirect_urls' in response.meta:
# assuming your spider has a method for handling the login
original_url = response.meta["redirect_urls"][0]
return Request(url="login_url",
callback=spider.login,
meta={"original_url": original_url})
return response
So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...
Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.
Remember to add the middleware to your settings:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
'myproject.middlewares.CustomDownloaderMiddleware': 543,
}
You can't achieve what you want because Scrapy uses asynchronous processing.
In theory you could use approach partially suggested in comment by #Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.
But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.
Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.
For example:
def parse_lookup(self, response):
if 'redirect_urls' in response.meta:
# It's a redirect
yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
else:
# It's a normal response
item = YourItem()
... # Extract your item fields from the response
yield item
next_item_url = ... # Extract the next page URL from the response
yield Request(url=next_item_url, callback=self.parse_lookup)
This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.
I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties in the request's meta and pop them out as necessary, or put the empties as a field of your spider.
Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.

Post print dictionary/json returns error to client

I am sending post request in the body of some json data, to process on server and I want the results back to client(c++ app on phone) in the form of json data and hence parse on mobile.
I have the following code inside handler:
class ServerHandler(tornado.web.RequestHandler):
def post(self):
data = tornado.escape.json_decode(self.request.body)
id = data.get('id',None)
#process data from db (take a while) and pack in result which is dictinary
result = process_data(id)# returns dictionary from db= takes time
print 'END OF HANDLER'
print json.dumps(result)
#before this code below I have tried also
#return result
#return self.write(result)
#return self.write(json.dumps(result))
#return json.dumps(result)
self.set_header('Content-Type', 'application/json')
json_ = tornado.escape.json_encode(result)
self.write(json_)
self.finish()
#return json.dumps(result)
I always get printed 'END OF HANDLER' and valid dictinary/json below on console but when I read at client mobile I always get
<html><title>405: Method Not Allowed</title><body>405: Method Not Allowed</body></html>
Does anyone have any idea what is the bug ?
(I am using CIwGameHttpRequest for sending request and it works when file is static =>name.json but now same content is giving error in post request. )
The error (HTTP 405 Method Not Allowed) means that you have made a request to a valid URL, but you are using an HTTP verb (e.g. GET, POST, PUT, DELETE) that cannot be used with that URL.
Your web service code appears to handle the POST verb, as evidenced by the post method name, and also by the fact that incoming requests appear to have a request body. You haven't shown us your C++ client code, so all I can do is to speculate that it is making a GET request. Does your C++ code call Request->setPOST();? (I haven't worked with CIwGameHttpRequest before, but Googling for it I found this page from which I took that line of code.)
I've not worked with Tornado before, but I imagine that there is some mechanism somewhere that allows you to connect a URL to a RequestHandler. Given that you have a 405 Method Not Allowed error rather than 404 Not Found, it seems that however this is done you've done it correctly. You issue a GET request to Tornado for the URL, it determines that it should call your handler, and only when it tries to use your handler it realises that it can't handle GET requests, concludes that your handler (and hence its URL) doesn't support GETs and returns a 405 error.

Downloader Middleware to ignore all requests to a certain URL in scrapy

I am trying to define a custom downloader middleware in Scrapy to ignore all requests to a particular URL (these requests are redirected from other URLs, so I can't filter them out when I generate the requests in the first place).
I have the following code, the idea of which is to catch this at the response processing stage (as I'm not exactly sure how requests redirecting to other requests works), check the URL, and if it matches the one I'm trying to filter out then return an IgnoreRequest exception, if not, return the response as usual so that it can continue to be processed.
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware:
def process_response(request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
return IgnoreRequest()
else:
return response
and I add this to the dict of middlewares:
DOWNLOADER_MIDDLEWARES = {
'acny.middlewares.CustomDownloaderMiddleware': 650
}
with a value of 650, which should - I think - make it run directly after the RedirectMiddleware.
However, when I run the crawler, I get an error saying:
ERROR: Error downloading <GET http://www.achurchnearyou.com/venue.php?V=00001>: process_response() got multiple values for keyword argument 'request'
This error is occurring on the very first page crawled, and I can't work out why it is occurring - I think I've followed what the manual said to do. What am I doing wrong?
I've found the solution to my own problem - it was a silly mistake with creating the class and method in Python. The code above needs to be:
from scrapy.exceptions import IgnoreRequest
from scrapy import log
class CustomDownloaderMiddleware(object):
def process_response(self, request, response, spider):
log.msg("In Middleware " + response.url, level=log.WARNING)
if response.url == "http://www.achurchnearyou.com//":
raise IgnoreRequest()
else:
return response
That is, there needs to be a self parameter for the method as the first parameter, and the class needs to inherit from object.
If you know which requests are redirected to the problematic ones, how about something like:
def parse_requests(self, response):
....
meta = {'handle_httpstatus_list': [301, 302]}
callback = 'process_redirects'
yield Request(url, callback=callback, meta=meta, ...)
def process_redirects(self, response):
url = response.headers['location']
if url is no good:
return
else:
...
This way you avoid downloading useless responses.
And you can always define your own custom redirect middleware.

python catch web redirection

my problem is about Web redirect ,, i'm using urllib>getcode() to know what status codes return
so here is my code
import urllib
a = urllib.urlopen("http://www.site.com/incorrect-tDirectory")
a.getcode()
a.getcode() return 200 but actually it's redirect to main page and i've check references that says redirect should return as i remember 300 or 301 but it's not 200 hopefully you got me
so my question how to catch the redirection
urllib2.urlopen() doc page says:
This function returns a file-like object with two additional methods:
geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such as headers, in the form of an mimetools.Message instance (see Quick Reference to HTTP Headers)
urllib.urlopen() actually implements geturl(), too, but it's not put as explicitly in the documentation.

Categories