How to access scrapy settings from item Pipeline - python

How do I access the scrapy settings in settings.py from the item pipeline. The documentation mentions it can be accessed through the crawler in extensions, but I don't see how to access the crawler in the pipelines.

UPDATE (2021-05-04)
Please note that this answer is now ~7 years old, so it's validity can no longer be ensured. In addition it is using Python2
The way to access your Scrapy settings (as defined in settings.py) from within your_spider.py is simple. All other answers are way too complicated. The reason for this is the very poor maintenance of the Scrapy documentation, combined with many recent updates & changes. Neither in the "Settings" documentation "How to access settings", nor in the "Settings API" have they bothered giving any workable example. Here's an example, how to get your current USER_AGENT string.
Just add the following lines to your_spider.py:
# To get your settings from (settings.py):
from scrapy.utils.project import get_project_settings
...
class YourSpider(BaseSpider):
...
def parse(self, response):
...
settings = get_project_settings()
print "Your USER_AGENT is:\n%s" % (settings.get('USER_AGENT'))
...
As you can see, there's no need to use #classmethod or re-define the from_crawler() or __init__() functions. Hope this helps.
PS. I'm still not sure why using from scrapy.settings import Settings doesn't work the same way, since it would be the more obvious choice of import?

Ok, so the documentation at http://doc.scrapy.org/en/latest/topics/extensions.html says that
The main entry point for a Scrapy extension (this also includes
middlewares and pipelines) is the from_crawler class method which
receives a Crawler instance which is the main object controlling the
Scrapy crawler. Through that object you can access settings, signals,
stats, and also control the crawler behaviour, if your extension needs
to such thing.
So then you can have a function to get the settings.
#classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
my_setting = settings.get("MY_SETTING")
return cls(my_setting)
The crawler engine then calls the pipeline's init function with my_setting, like so:
def __init__(self, my_setting):
self.my_setting = my_setting
And other functions can access it with self.my_setting, as expected.
Alternatively, in the from_crawler() function you can pass the crawler.settings object to __init__(), and then access settings from the pipeline as needed instead of pulling them all out in the constructor.

The correct answer is: it depends where in the pipeline you wish to access the settings.
avaleske has answered as if you wanted access to the settings outside of your pipelines process_item method but it's very likely this is where you'll want the setting and therefore there is a much easier way as the Spider instance itself gets passed in as an argument.
class PipelineX(object):
def process_item(self, item, spider):
wanted_setting = spider.settings.get('WANTED_SETTING')

the project structure is quite flat, why not:
# pipeline.py
from myproject import settings

Related

Python class variable usage across thread have different value

I'm getting confused about class variable access through multi-thread. I'm in a django application and I initialize a class at server startups to inject some configuration. Then on user requests I call some method on that class but the config is None.
class SharedService:
_queue_name = None
#staticmethod
def config(queue_name=None):
if queue_name:
SharedService._queue_name = queue_name
print(SharedService._queue_name) # this print the "alert"
#staticmethod
def send_message(data):
print(SharedService._queue_name) # this print None should print "alert"
if usefull in the django settings class loaded at startup I do:
SharedService.config(queue_name="alert")
and in the user endpoint I just call:
SharedService.send_message("blabla")
I'm sure it did work previously, but we did update to python 3.10 and django 3.2 recently and might be related (or not!)
Ok that was completely from outer field!
The shared service is from an external library and can actually be imported from two paths because of errors in our python path.
In the settings file, I was doing
from lib.services import SharedService
And from the request handling code
from services import SharedService
Which obviously creates two different class definitions and thus, the second usage isn’t initialized.

How to set IMAGES_STORE folder per Item in Scrapy 1.5

Scrapy 1.5 allows setting an IMAGES_STORE setting for storing all downloaded media as explained in documentation
I would like to be able to specify a custom folder per Item based on some values in the Item. Not knowing much about internals of Scrapy, I am not sure exactly which methods to override to accomplish this.
I thought about overriding from_settings(cls, settings) but there I do not have access to Item yet.
Any ideas?
I solved the issue by overriding file_path method. So in IMAGES_STORE I have the base path then I control the variable part from file_path. Something like below. However I had a typo first and scrapy silently ignored it without printing any errors even in debug... I don't know why? So it is best to start with a simple string for testing.
def file_path(self, request, response=None, info=None):
url = request.url
image_guid = hashlib.sha1(to_bytes(url)).hexdigest()
return '%s/full/%s.jpg' % ('my_custom_path', image_guid)

Can't call a decorator within the imported sub-class of a cherrpy application (site tree)

I am using cherrypy as a web server, and I want to check a user's logged-in status before returning the page. This works on methods in the main Application class (in site.py) but gives an error when I call the same decorated function on method in a class that is one layer deeper in the webpage tree (in a separate file).
validate_user() is the function used as a decorator. It either passes a user to the page or sends them to a 401 restricted page, as a cherrypy.Tool, like this:
from user import validate_user
cherrypy.tools.validate_user = cherrypy.Tool('before_handler', validate_user)
I attach different sections of the site to the main site.py file's Application class by assigning instances of the sub-classes as variables accordingly:
from user import UserAuthentication
class Root:
user = UserAuthentication() # maps user/login, user/register, user/logout, etc
admin = Admin()
api = Api()
#cherrypy.expose
#cherrypy.tools.validate_user()
def how_to(self, **kw):
from other_stuff import how_to_page
return how_to_page(kw)
This, however, does not work when I try to use the validate_user() inside the Admin or Api or Analysis sections. These are in separate files.
import cherrypy
class Analyze:
#cherrypy.expose
#cherrypy.tools.validate_user() #### THIS LINE GIVES ERROR ####
def explore(self, *args, **kw): # #addkw(fetch=['uid'])
import explore
kw['uid'] = cherrypy.session.get('uid',-1)
return explore.explorer(args, kw)
The error is that cherrypy.tools doesn't have a validate_user function or method. But other things I assign in site.py do appear in cherrypy here. What's the reason why I can't use this tool in a separate file that is part of my overall site map?
If this is relevant, the validate_user() function simply looks at the cherrypy.request.cookie, finds the 'session_token' value, and compares it to our database and passes it along if the ID matches.
Sorry I don't know if the Analyze() and Api() and User() pages are subclasses, or nested classes, or extended methods, or what. So I can't give this a precise title. Do I need to pass in the parent class to them somehow?
The issue here is that Python processes everything except the function/method bodies during import. So in site.py, when you import user (or from user import <anything>), that causes all of the user module to be processed before the Python interpreter has gotten to the definition of the validate_user tool, including the decorator, which is attempting to access that tool by value (rather than by a reference).
CherryPy has another mechanism for decorating functions with config that will enable tools on those handlers. Instead of #cherrypy.tools.validate_user, use:
#cherrypy.config(**{"tools.validate_user.on": True})
This decorator works because instead of needing to access validate_user from cherrypy.tools to install itself on the handler, it instead configures CherryPy to install that tool on the handler later, when the handler is invoked.
If that tool is needed for all methods on that class, you can use that config decorator on the class itself.
You could alternatively, enable that tool for given endpoints in the server config, as mentioned in the other question.

Scrapy: overwrite DEPTH_LIMIT variable based on value read from custom config

I am using InitSpider and read a custom json configuration within the def __init__(self, *a, **kw): method.
The json config file contains a directive with which I can control the crawling depth. I can already successfully read this configuration file and extract the value. The main problem is how to tell scrapy to use this value.
Note: I dont want to use a command line argument such as -s DEPTH_LIMIT=3, I actually want to parse it from my custom configuration.
DEPTH_LIMIT is used in scrapy.spidermiddlewares.depth.DepthMiddleware. As you might have had a quick look at the code, you'll see that the DEPTH_LIMIT value is read only when initializing that middleware.
I think this might be a good solution to you:
In the __init__ method of your spider, set a spider attribute max_depth with your custom value.
Override scrapy.spidermiddlewares.depth.DepthMiddleware and have it check the max_depth attribute.
Disable the default DepthMiddleware and enable your own one in the settings.
See also http://doc.scrapy.org/en/latest/topics/spider-middleware.html
A quick example of the overridden middleware described in step #2:
class MyDepthMiddleware(DepthMiddleware):
def process_spider_output(self, response, result, spider):
if hasattr(spider, 'max_depth'):
self.maxdepth = getattr(spider, 'max_depth')
return super(MyDepthMiddleware, self).process_spider_output(response, result, spider)

Call method dynamically if it exists in Django

I'm creating website based on Django (I know it's pure Python, so maybe it could be also answered by people who knows Python well) and I need to call some methods dynamically.
For example I have few applications (modules) in my website with the method "do_search()" in the views.py. Then I have one module called for example "search" and there I want to have an action which will be able to call all the existing "do_search()" in other applications. Of course I don't like to add each application to the import, then call it directly. I need some better way to do it dynamically.
I can read INSTALLED_APPS variable from settings and somehow run through all of the installed apps and look for the specific method? Piece of code will help here a lot :)
Thanks in advance!
Ignas
I'm not sure if I truly understand the question, but please clarify in a comment to my answer if I'm off.
# search.py
searchables = []
def search(search_string):
return [s.do_search(search_string) for s in searchables]
def register_search_engine(searchable):
if hasattr(searchable, 'do_search'):
# you want to see if this is callable also
searchables.append(searchable)
else:
# raise some error perhaps
# views.py
def do_search(search_string):
# search somehow, and return result
# models.py
# you need to ensure this method runs before any attempt at searching can begin
# like in models.py if this app is within installed_apps. the reason being that
# this module may not have been imported before the call to search.
import search
from views import do_search
search.register_search_engine(do_search)
As for where to register your search engine, there is some sort of helpful documentation in the signals docs for django which relates to this.
You can put signal handling and registration code anywhere you like. However, you'll need to make sure that the module it's in gets imported early on so that the signal handling gets registered before any signals need to be sent. This makes your app's models.py a good place to put registration of signal handlers.
So your models.py file should be a good place to register your search engine.
Alternative answer that I just thought of:
In your settings.py, you can have a setting that declares all your search functions. Like so:
# settings.py
SEARCH_ENGINES = ('app1.views.do_search', 'app2.views.do_search')
# search.py
from django.conf import settings
from django.utils import importlib
def search(search_string):
search_results = []
for engine in settings.SEARCH_ENGINES
i = engine.rfind('.')
module, attr = engine[:i], engine[i+1:]
mod = importlib.import_module(module)
do_search = getattr(mod, attr)
search_results.append(do_search(search_string))
return search_results
This works somewhat similar to registering MIDDLEWARE_CLASSES and TEMPLATE_CONTEXT_PROCESSORS. The above is all untested code, but if you look around the django source, you should be able to flesh this out and remove any errors.
If you can import the other applications through
import other_app
then it should be possible to perform
method = getattr(other_app, 'do_' + method_name)
result = method()
However your approach is questionable.

Categories