Passing custom parameter to scrapy request - python

I wanna set a custom parameter in my request so I can retrieve it when I process it in parse_item. This is my code:
def start_requests(self):
yield Request("site_url", meta={'test_meta_key': 'test_meta_value'})
def parse_item(self, response):
print response.meta
parse_item will be called according to the following rules:
self.rules = (
Rule(SgmlLinkExtractor(deny=tuple(self.deny_keywords), allow=tuple(self.client_keywords)), callback='parse_item'),
Rule(SgmlLinkExtractor(deny=tuple(self.deny_keywords), allow=('', ))),
)
According to scrapy doc:
the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.
But I don't see the custom meta in parse_item. Is there anyway to fix this? Is meta the right way to go with?

When you generate a new Request, you need to specify the callback function, otherwise it will be passed to the parse method of CrawlSpider as default.
I ran into a similar problem and it took me a while to debug.
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
method (string) – the HTTP method of this request. Defaults to 'GET'.
meta (dict) – the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.

To pass an extra parameter you must use
cb_kwargs, then call the parameter in the parse method.
You can refer to this part of the documentation.

I believe what you want is response.meta['test_meta_key']. That's the way to access meta parameters.

Related

Scrapy request chaining not working with Spider Middleware

Similar to what is done in the link:
How can i use multiple requests and pass items in between them in scrapy python
I am trying to chain requests from spiders like in Dave McLain's answer. Returning a request object from parse function works fine, allowing the spider to continue with the next request.
def parse(self, response):
# Some operations
self.url_index += 1
if self.url_index < len(self.urls):
return scrapy.Request(url=self.urls[self.url_index], callback=self.parse)
return items
However, I have the default Spider Middleware where I do some caching and logging operations in the spider_process_output. Returning the request object from the parse function first goes into middleware. So, the middleware has to return the request object as well.
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
if hasattr(spider, 'multiple_urls'):
if spider.url_index + 1 < len(spider.urls):
return [result]
# return [scrapy.Request(url=spider.urls[spider.url_index], callback=spider.parse)]
# Some operations ...
According to the documentation, it must return iterable of Request, or item objects. However, when I return the result (which is a Request object), or construct a new request object (as in the comment), the spider just terminates (by giving spider finished signal) without making a new request.
Documentation link: https://docs.scrapy.org/en/latest/topics/spider-middleware.html#writing-your-own-spider-middleware
I am not sure if there is an issue with the documentation or the way I interpret it. But, returning request objects from the middleware doesn't make new request, instead it terminates the flow.
It was quite simple yet frustrating to solve the problem. The middleware is supposed to return iterable of request objects. However, putting the request object into a list (which is an iterable) doesn't seem to work. Using yield result in the process_spider_output middleware function instead works.
Since the main issue is resolved, I'll leave this answer as a reference. Better explanations of why this is the case are appreciated.

Django URL passing parameter

Would anyone please explain to me how to passing parameter in Django URL
I created view in view.py "def hours_ahead(request, offset)" to calculate the time ahead. Then on URL.py I call it with re_path(r'^time/plus/\d{1,2}/$', hours_ahead).
But I get the error message when I put the url http://127.0.0.1:8000/time/plus/2/
TypeError at /time/plus/2/
hours_ahead() missing 1 required positional argument: 'offset'
offset should be the data capture in the URL (identified within brackets in your URL pattern):
(r'^time/plus/(\d{1,2})/$', hours_ahead)
You configured the regular expression in your URL but didn't specify the parameter name that will be sent to your view function.
The following URL definition should work. Notice that the ?P part that is used to define the parameter that will capture the value matching the regular expression.
re_path(r'^time/plus/(?P<offset>\d{1,2})/$', hours_ahead)
Note 1: that if you're not really strict on the amount of hours that can be added, you can use simpler URL definition that just specifies an integer:
path('time/plus/<int:offset>/$', hours_ahead)
But that will allow a caller of your URL to add a huge amount of hours, as long as it fits the integer.
Note 2: Not really your question but if you think about correct API / interface design this can be important: what you're doing is an action, adding hours. That action is now triggered by an HTTP GET request. Actions should always be triggered by other HTTP methods such as POST, PATCH, etc.

how to change flask default http method from GET to POST

currently flask use GET as default HTTP method, is there any flexible way to change this default method to POST for all app.route?
The only way I can think to do this would be to run your own version of Flask that changes this code to default to a POST
# if the methods are not given and the view_func object knows its
# methods we can use that instead. If neither exists, we go with
# a tuple of only ``GET`` as default.
if methods is None:
methods = getattr(view_func, 'methods', None) or ('GET',)
becomes...
# if the methods are not given and the view_func object knows its
# methods we can use that instead. If neither exists, we go with
# a tuple of only ``GET`` as default.
if methods is None:
methods = getattr(view_func, 'methods', None) or ('POST',)
Code: Lines 1184-1188
Though at this point it's probably just simpler to add a method declaration of POST to each route definition.

Scrapy: overwrite DEPTH_LIMIT variable based on value read from custom config

I am using InitSpider and read a custom json configuration within the def __init__(self, *a, **kw): method.
The json config file contains a directive with which I can control the crawling depth. I can already successfully read this configuration file and extract the value. The main problem is how to tell scrapy to use this value.
Note: I dont want to use a command line argument such as -s DEPTH_LIMIT=3, I actually want to parse it from my custom configuration.
DEPTH_LIMIT is used in scrapy.spidermiddlewares.depth.DepthMiddleware. As you might have had a quick look at the code, you'll see that the DEPTH_LIMIT value is read only when initializing that middleware.
I think this might be a good solution to you:
In the __init__ method of your spider, set a spider attribute max_depth with your custom value.
Override scrapy.spidermiddlewares.depth.DepthMiddleware and have it check the max_depth attribute.
Disable the default DepthMiddleware and enable your own one in the settings.
See also http://doc.scrapy.org/en/latest/topics/spider-middleware.html
A quick example of the overridden middleware described in step #2:
class MyDepthMiddleware(DepthMiddleware):
def process_spider_output(self, response, result, spider):
if hasattr(spider, 'max_depth'):
self.maxdepth = getattr(spider, 'max_depth')
return super(MyDepthMiddleware, self).process_spider_output(response, result, spider)

Can I specify any method as the callback when constructing a Scrapy Request object?

I'm trying to create a request and have been previously passing a function in my spider class as the callback. However, I've since moved that function to an Item subclass, because I'd like to have different types of Items and the callback may be different for each sort of item (e.g. at the moment I'm going to raise DropItem if the content type isn't as expected, and have a different set of valid MIME types for each type of Item). So, what I was wondering was can I pass a function from my Item subclass as the callback parameter? Basically like so:
item = MyCustomItem() # Extends scrapy.item.Item
# bunch of code here...
req = Request(urlparse.urljoin(response.url, url), method="HEAD", callback=item.parse_resource_metadata)
At the moment item.parse_resource_metadata isn't getting called. Printing req.callback gives
<bound method ZipResource.parse_resource_metadata of {(correct data for this Item object}>
so it at least constructs the request as I had hoped it would.
[edit] Mea culpa, the callback wasn't called because the start page wasn't being crawled (I had to override parse_start_url(). But turns out I was doing things wrong, so good thing I asked!
Theoretically, it is doable since callback is just a callable that has a response as it's argument.
Though, Items are just containers of the fields, they are for storing data, you should not put logic there.
Better create a method in the spider and pass the item instance inside meta:
def parse(self, response):
...
item = MyCustomItem()
...
yield Request(urlparse.urljoin(response.url, url),
method="HEAD",
meta={'item': item},
callback=self.my_callback)
def my_callback(self, response):
item = response.meta['item']
...
I'm not completely sure what you are trying to achieve, but you might also take a closer look at Item Loaders and Input and Output Processors.

Categories