Scrapy: overwrite DEPTH_LIMIT variable based on value read from custom config - python

I am using InitSpider and read a custom json configuration within the def __init__(self, *a, **kw): method.
The json config file contains a directive with which I can control the crawling depth. I can already successfully read this configuration file and extract the value. The main problem is how to tell scrapy to use this value.
Note: I dont want to use a command line argument such as -s DEPTH_LIMIT=3, I actually want to parse it from my custom configuration.

DEPTH_LIMIT is used in scrapy.spidermiddlewares.depth.DepthMiddleware. As you might have had a quick look at the code, you'll see that the DEPTH_LIMIT value is read only when initializing that middleware.
I think this might be a good solution to you:
In the __init__ method of your spider, set a spider attribute max_depth with your custom value.
Override scrapy.spidermiddlewares.depth.DepthMiddleware and have it check the max_depth attribute.
Disable the default DepthMiddleware and enable your own one in the settings.
See also http://doc.scrapy.org/en/latest/topics/spider-middleware.html
A quick example of the overridden middleware described in step #2:
class MyDepthMiddleware(DepthMiddleware):
def process_spider_output(self, response, result, spider):
if hasattr(spider, 'max_depth'):
self.maxdepth = getattr(spider, 'max_depth')
return super(MyDepthMiddleware, self).process_spider_output(response, result, spider)

Related

How to build a wrapper pytest plugin?

I want to wrap the pytest-html plugin in the following way:
Add an option X
Given the option X, delete data from the report
I was able to add the option with implementing the pytest_addoption(parser) function, but got stuck on the 2nd thing...
What I was able to do is this: implement a hook frmo pytest-html. However, I have to access my option X, in order to do what to do. The problem is, pytest-html's hook does not give the "request" object as a param, so I can't access the option value...
Can I have additional args for a hook? or something like this?
You can attach additional data to the report object, for example via a custom wrapper around the pytest_runtest_makereport hook:
#pytest.hookimpl(hookwrapper=True)
def pytest_runtest_makereport(item, call):
outcome = yield
report = outcome.get_result()
report.config = item.config
Now the config object will be accessible via report.config in all reporting hooks, including the ones of pytest-html:
def pytest_html_report_title(report):
""" Called before adding the title to the report """
assert report.config is not None

How to set IMAGES_STORE folder per Item in Scrapy 1.5

Scrapy 1.5 allows setting an IMAGES_STORE setting for storing all downloaded media as explained in documentation
I would like to be able to specify a custom folder per Item based on some values in the Item. Not knowing much about internals of Scrapy, I am not sure exactly which methods to override to accomplish this.
I thought about overriding from_settings(cls, settings) but there I do not have access to Item yet.
Any ideas?
I solved the issue by overriding file_path method. So in IMAGES_STORE I have the base path then I control the variable part from file_path. Something like below. However I had a typo first and scrapy silently ignored it without printing any errors even in debug... I don't know why? So it is best to start with a simple string for testing.
def file_path(self, request, response=None, info=None):
url = request.url
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return '%s/full/%s.jpg' % ('my_custom_path', image_guid)

Can't call a decorator within the imported sub-class of a cherrpy application (site tree)

I am using cherrypy as a web server, and I want to check a user's logged-in status before returning the page. This works on methods in the main Application class (in site.py) but gives an error when I call the same decorated function on method in a class that is one layer deeper in the webpage tree (in a separate file).
validate_user() is the function used as a decorator. It either passes a user to the page or sends them to a 401 restricted page, as a cherrypy.Tool, like this:
from user import validate_user
cherrypy.tools.validate_user = cherrypy.Tool('before_handler', validate_user)
I attach different sections of the site to the main site.py file's Application class by assigning instances of the sub-classes as variables accordingly:
from user import UserAuthentication
class Root:
user = UserAuthentication() # maps user/login, user/register, user/logout, etc
admin = Admin()
api = Api()
#cherrypy.expose
#cherrypy.tools.validate_user()
def how_to(self, **kw):
from other_stuff import how_to_page
return how_to_page(kw)
This, however, does not work when I try to use the validate_user() inside the Admin or Api or Analysis sections. These are in separate files.
import cherrypy
class Analyze:
#cherrypy.expose
#cherrypy.tools.validate_user() #### THIS LINE GIVES ERROR ####
def explore(self, *args, **kw): # #addkw(fetch=['uid'])
import explore
kw['uid'] = cherrypy.session.get('uid',-1)
return explore.explorer(args, kw)
The error is that cherrypy.tools doesn't have a validate_user function or method. But other things I assign in site.py do appear in cherrypy here. What's the reason why I can't use this tool in a separate file that is part of my overall site map?
If this is relevant, the validate_user() function simply looks at the cherrypy.request.cookie, finds the 'session_token' value, and compares it to our database and passes it along if the ID matches.
Sorry I don't know if the Analyze() and Api() and User() pages are subclasses, or nested classes, or extended methods, or what. So I can't give this a precise title. Do I need to pass in the parent class to them somehow?
The issue here is that Python processes everything except the function/method bodies during import. So in site.py, when you import user (or from user import <anything>), that causes all of the user module to be processed before the Python interpreter has gotten to the definition of the validate_user tool, including the decorator, which is attempting to access that tool by value (rather than by a reference).
CherryPy has another mechanism for decorating functions with config that will enable tools on those handlers. Instead of #cherrypy.tools.validate_user, use:
#cherrypy.config(**{"tools.validate_user.on": True})
This decorator works because instead of needing to access validate_user from cherrypy.tools to install itself on the handler, it instead configures CherryPy to install that tool on the handler later, when the handler is invoked.
If that tool is needed for all methods on that class, you can use that config decorator on the class itself.
You could alternatively, enable that tool for given endpoints in the server config, as mentioned in the other question.

Passing custom parameter to scrapy request

I wanna set a custom parameter in my request so I can retrieve it when I process it in parse_item. This is my code:
def start_requests(self):
yield Request("site_url", meta={'test_meta_key': 'test_meta_value'})
def parse_item(self, response):
print response.meta
parse_item will be called according to the following rules:
self.rules = (
Rule(SgmlLinkExtractor(deny=tuple(self.deny_keywords), allow=tuple(self.client_keywords)), callback='parse_item'),
Rule(SgmlLinkExtractor(deny=tuple(self.deny_keywords), allow=('', ))),
)
According to scrapy doc:
the Response.meta attribute is propagated along redirects and retries, so you will get the original Request.meta sent from your spider.
But I don't see the custom meta in parse_item. Is there anyway to fix this? Is meta the right way to go with?
When you generate a new Request, you need to specify the callback function, otherwise it will be passed to the parse method of CrawlSpider as default.
I ran into a similar problem and it took me a while to debug.
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
method (string) – the HTTP method of this request. Defaults to 'GET'.
meta (dict) – the initial values for the Request.meta attribute. If given, the dict passed in this parameter will be shallow copied.
To pass an extra parameter you must use
cb_kwargs, then call the parameter in the parse method.
You can refer to this part of the documentation.
I believe what you want is response.meta['test_meta_key']. That's the way to access meta parameters.

How to access scrapy settings from item Pipeline

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.
UPDATE (2021-05-04)
Please note that this answer is now ~7 years old, so it's validity can no longer be ensured. In addition it is using Python2
The way to access your Scrapy settings (as defined in settings.py) from within your_spider.py is simple. All other answers are way too complicated. The reason for this is the very poor maintenance of the Scrapy documentation, combined with many recent updates & changes. Neither in the "Settings" documentation "How to access settings", nor in the "Settings API" have they bothered giving any workable example. Here's an example, how to get your current USER_AGENT string.
Just add the following lines to your_spider.py:
# To get your settings from (settings.py):
from scrapy.utils.project import get_project_settings
...
class YourSpider(BaseSpider):
...
def parse(self, response):
...
settings = get_project_settings()
print "Your USER_AGENT is:\n%s" % (settings.get('USER_AGENT'))
...
As you can see, there's no need to use #classmethod or re-define the from_crawler() or __init__() functions. Hope this helps.
PS. I'm still not sure why using from scrapy.settings import Settings doesn't work the same way, since it would be the more obvious choice of import?
Ok, so the documentation at http://doc.scrapy.org/en/latest/topics/extensions.html says that
The main entry point for a Scrapy extension (this also includes
middlewares and pipelines) is the from_crawler class method which
receives a Crawler instance which is the main object controlling the
Scrapy crawler. Through that object you can access settings, signals,
stats, and also control the crawler behaviour, if your extension needs
to such thing.
So then you can have a function to get the settings.
#classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
my_setting = settings.get("MY_SETTING")
return cls(my_setting)
The crawler engine then calls the pipeline's init function with my_setting, like so:
def __init__(self, my_setting):
self.my_setting = my_setting
And other functions can access it with self.my_setting, as expected.
Alternatively, in the from_crawler() function you can pass the crawler.settings object to __init__(), and then access settings from the pipeline as needed instead of pulling them all out in the constructor.
The correct answer is: it depends where in the pipeline you wish to access the settings.
avaleske has answered as if you wanted access to the settings outside of your pipelines process_item method but it's very likely this is where you'll want the setting and therefore there is a much easier way as the Spider instance itself gets passed in as an argument.
class PipelineX(object):
def process_item(self, item, spider):
wanted_setting = spider.settings.get('WANTED_SETTING')
the project structure is quite flat, why not:
# pipeline.py
from myproject import settings

Categories