Scrapy request chaining not working with Spider Middleware - python

Similar to what is done in the link:
How can i use multiple requests and pass items in between them in scrapy python
I am trying to chain requests from spiders like in Dave McLain's answer. Returning a request object from parse function works fine, allowing the spider to continue with the next request.
def parse(self, response):
# Some operations
self.url_index += 1
if self.url_index < len(self.urls):
return scrapy.Request(url=self.urls[self.url_index], callback=self.parse)
return items
However, I have the default Spider Middleware where I do some caching and logging operations in the spider_process_output. Returning the request object from the parse function first goes into middleware. So, the middleware has to return the request object as well.
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
if hasattr(spider, 'multiple_urls'):
if spider.url_index + 1 < len(spider.urls):
return [result]
# return [scrapy.Request(url=spider.urls[spider.url_index], callback=spider.parse)]
# Some operations ...
According to the documentation, it must return iterable of Request, or item objects. However, when I return the result (which is a Request object), or construct a new request object (as in the comment), the spider just terminates (by giving spider finished signal) without making a new request.
Documentation link: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#writing-your-own-spider-middleware
I am not sure if there is an issue with the documentation or the way I interpret it. But, returning request objects from the middleware doesn't make new request, instead it terminates the flow.

It was quite simple yet frustrating to solve the problem. The middleware is supposed to return iterable of request objects. However, putting the request object into a list (which is an iterable) doesn't seem to work. Using yield result in the process_spider_output middleware function instead works.
Since the main issue is resolved, I'll leave this answer as a reference. Better explanations of why this is the case are appreciated.

Related

How can I apply iter() to a pagination api?

I watched Raymond Hettinger's Idiomatic Python talk, and learned about the sentinel argument to iter().
I'd like to try to apply it to a piece of code I'm working on iterating over an API that uses pagination (it's Twilio, but not relevant to my question).
I have an API that returns: a list of data, and a next page URL. When the pagination is exhausted, the next page URL returns as an empty string. I wrote the fetching function as a generator and looks roughly like this:
def fetch(url):
while url:
data = requests.get(url).json()
url = data['next_page_uri']
for row in data[resource]:
yield row
This code works fine, but I'd like to try to remove the while loop and replace it with a call to iter() using the next_page_uri value as the sentinel argument.
Alternately, could this be written with a yield from?
I think this might be what you mean… but as stated in the comments, it doesn't help much:
def fetch_paged(url):
while url:
res = requests.get(url)
res.raise_for_status()
data = res.json()
yield data
url = data['next_page_uri']
def fetch(url):
for data in fetch_paged(url):
yield from data[resource]
(I've taken the opportunity to put in a call to raise_for_status() which will raise for non-successful, i.e. res.status_code < 400, responses)
not sure if it's any "better", but possibly if you're going to be reusing the fetch_paged functionality a lot
Note: lots of other APIs put this next_page_uri into the response headers in standard ways which the requests library knows how to deal with and exposes via the res.links attribute

Scrapy contracts with multiple parse methods

What's the best approach to write contracts for Scrapy spiders that have more than one method to parse the response?
I saw this answer but it didn't sound very clear to me.
My current example: I have a method called parse_product that extracts the information on a page but I have more data that I need to extract for the same product in another page, so I yield a new request at the end of this method to make a new request and let the new callback extracts theses fields and returns the item.
The problem is that if I write a contract for the second method, it will fail because it doesn't have the meta attribute (containing the item with most of the fields). If I write a contract for the first method, I can't check if it returns the fields, because it returns a new request, instead of the item.
def parse_product(self, response):
il = ItemLoader(item=ProductItem(), response=response)
# populate the item in here
# yield the new request sending the ItemLoader to another callback
yield scrapy.Request(new_url, callback=self.parse_images, meta={'item': il})
def parse_images(self, response):
"""
#url http://foo.bar
#returns items 1 1
#scrapes field1 field2 field3
"""
il = response.request.meta['item']
# extract the new fields and add them to the item in here
yield il.load_item()
In the example, I put the contract in the second method, but it gave me a KeyError exception on response.request.meta['item'], also, the fields field1 and field2 are populated in the first method.
Hope it's clear enough.
Frankly, I don't use Scrapy contracts and I don't really recommend anyone to use them either. They have many issues and someday may be removed from Scrapy.
In practice, I haven't had much luck using unit tests for spiders.
For testing spiders during development, I'd enable the cache and then re-run the spider as many times as needed to get the scraping right.
For regression bugs, I had better luck using item pipelines (or spider middlewares) that do validation on-the-fly (there is only so much you can catch in early testing anyway). It's also a good idea to have some strategies for recovering.
And for maintaining a healthy codebase, I'd be constantly moving library-like code out from the spider itself to make it more testable.
Sorry if this isn't the answer you're looking for.

Can I specify any method as the callback when constructing a Scrapy Request object?

I'm trying to create a request and have been previously passing a function in my spider class as the callback. However, I've since moved that function to an Item subclass, because I'd like to have different types of Items and the callback may be different for each sort of item (e.g. at the moment I'm going to raise DropItem if the content type isn't as expected, and have a different set of valid MIME types for each type of Item). So, what I was wondering was can I pass a function from my Item subclass as the callback parameter? Basically like so:
item = MyCustomItem() # Extends scrapy.item.Item
# bunch of code here...
req = Request(urlparse.urljoin(response.url, url), method="HEAD", callback=item.parse_resource_metadata)
At the moment item.parse_resource_metadata isn't getting called. Printing req.callback gives
<bound method ZipResource.parse_resource_metadata of {(correct data for this Item object}>
so it at least constructs the request as I had hoped it would.
[edit] Mea culpa, the callback wasn't called because the start page wasn't being crawled (I had to override parse_start_url(). But turns out I was doing things wrong, so good thing I asked!
Theoretically, it is doable since callback is just a callable that has a response as it's argument.
Though, Items are just containers of the fields, they are for storing data, you should not put logic there.
Better create a method in the spider and pass the item instance inside meta:
def parse(self, response):
...
item = MyCustomItem()
...
yield Request(urlparse.urljoin(response.url, url),
method="HEAD",
meta={'item': item},
callback=self.my_callback)
def my_callback(self, response):
item = response.meta['item']
...
I'm not completely sure what you are trying to achieve, but you might also take a closer look at Item Loaders and Input and Output Processors.

Flask: How do I get the returned value from a function that uses #app.route decorator?

So I'm pretty new to Flask and I'm trying to make my mind around one thing. So, if I understand well when you write a function within a Flask app and you use the #app.route decorator in that function, it only runs when you hit that path/url.
I have a small oauth app written in Flask that goes through all the authorization flow and then it return the token.
My question is how do I get that token from the #decorated function? For example, lets say I have something like this:
#app.route(/token/)
def getToken(code): #code from the callback url.
#/Stuff to get the Token/
#/**********************/
return token
If I hit the (/token/) url-path the function returns the token. But now I need to get that token and use it in another function to write and read from the API I just got the token from. My initial thought was doing this:
token = getToken(code)
But if I do that, I get this error:
RuntimeError: working outside of request context
So again, my question is, how do I get the token so I can pass it as a parameter to other functions.
Extract the token generation code into a separate function, so that you can call it from anywhere, including the view function. It's a good practice to keep the application logic away from the view, and it also helps with unit testing.
I assume your route includes a placeholder for code, which you skipped:
def generateToken(code):
#/Stuff to get the Token/
#/**********************/
return token
#app.route('/token/<string:code>')
def getToken(code):
return generateToken(code)
Just keep in mind that generateToken shouldn't depend on the request object. If you need any request data (e.g. HTTP header), you should pass it explicitly in arguments. Otherwise you will get the "working outside of request context" exception you mentioned.
It is possible to call request-dependent views directly, but you need to mock the request object, which is a bit tricky. Read the request context documentation to learn more.
not sure what the context is. You could just call the method.
from yourmodule import get_token
def yourmethod():
token = get_token()
Otherwise, you could use the requests library in order to retrieve the data from the route
>>> import requests
>>> response = requests.get('www.yoursite.com/yourroute/')
>>> print response.text
If you're looking for unittests, Flask comes with a mock client
def test_get_token():
resp = self.app.get('/yourroute')
# do something with resp.data

Passing custom parameter to scrapy request

I wanna set a custom parameter in my request so I can retrieve it when I process it in parse_item. This is my code:
def start_requests(self):
yield Request("site_url", meta={'test_meta_key': 'test_meta_value'})
def parse_item(self, response):
print response.meta
parse_item will be called according to the following rules:
self.rules = (
Rule(SgmlLinkExtractor(deny=tuple(self.deny_keywords), allow=tuple(self.client_keywords)), callback='parse_item'),
Rule(SgmlLinkExtractor(deny=tuple(self.deny_keywords), allow=('', ))),
)
According to scrapy doc:
the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.
But I don't see the custom meta in parse_item. Is there anyway to fix this? Is meta the right way to go with?
When you generate a new Request, you need to specify the callback function, otherwise it will be passed to the parse method of CrawlSpider as default.
I ran into a similar problem and it took me a while to debug.
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
method (string) – the HTTP method of this request. Defaults to 'GET'.
meta (dict) – the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.
To pass an extra parameter you must use
cb_kwargs, then call the parameter in the parse method.
You can refer to this part of the documentation.
I believe what you want is response.meta['test_meta_key']. That's the way to access meta parameters.

Categories