Scrapy how to set DEPTH_LIMIT from a command line argument

Scrapy how to set DEPTH_LIMIT from a command line argument - python

I currently have DEPTH_LIMIT set in the settings module for the scraper I am building. I would like to be able to pass a depth limit in as a command line argument. I have tried the following as a constructor for the crawler (and variations of it):
def __init__(self, max_depth=3, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.settings['DEPTH_LIMIT'] = int(max_depth)
but, I get an error and the stack dump ends with:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spider.py", line 41, in crawler
assert hasattr(self, '_crawler'), "Spider not bounded to any crawler"
AssertionError: Spider not bounded to any crawler
Even trying to print self.settings['DEPTH_LIMIT'] in the constructor causes and error. How can I set the DEPTH_LIMIT in a crawler from a command line argument?
Thanks!

you may try this approach:
def __init__(self, *args, **kwargs):
self.settings['DEPTH_LIMIT'] = int(kwargs.pop('max_depth', 3))
super(MySpider, self).__init__(*args, **kwargs)
for details on pop you may refer to python official documentation
if this does not works, please add some more code on how you created the crawler object (e.g. the class definition, and where do you define settings attribute)

Related

Redirect unknown attributes

I'm building a custom class to add feautures to selenium.webdriver.Chrome on Python 3.6.2.
from selenium import webdriver
class MyChrome:
def __init__(self):
self.mydriver = webdriver.Chrome()
So far, beside some custom methods I made myself, I used to overridden some selenium.webdriver.Chrome very standard methods like this:
def get(self, url):
self.mydriver.get(url)
Since I don't want to waste time rewriting like that methods like get, find_element_by_xpath, etc... that already works fine for me I tried the following, as suggested here and here
def __getattr__(self, name, *args, **kwargs):
return getattr(self.mydriver, name)(*args, **kwargs)
But when I run the following code
from selenium import webdriver
class MyChrome:
def __init__(self):
self.mydriver = webdriver.Chrome()
def __getattr__(self, name, *args, **kwargs):
return getattr(self.mydriver, name)(*args, **kwargs)
chrome = MyChrome()
chrome.get('https://stackoverflow.com/')
I encounter the error
Traceback (most recent call last):
File "MyChrome.py", line 11, in <module>
chrome.get('https://stackoverflow.com/')
File "MyChrome.py", line 8, in __getattr__
return getattr(self.mydriver, name)(*args, **kwargs)
TypeError: get() missing 1 required positional argument: 'url'
How do I redirect calls to unknown methods called on my object chrome to it's instance variable self.driver?

I created a custom Webdriver once to add features to selenium Chrome Webdriver and I did that by subclassing selenium Chrome Webdriver. This way you inherit all the webdriver methods without a getattr
from selenium.webdriver import Chrome
class MyChrome(Chrome):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# add all your custom features
It doesn't provide answer to your question but something you can leverage

How to edit Python scripts with correct autocompletion?

I am trying to edit python scripts in first time and it drives me crazy :( .
I have some directory with *.py files, that I added to PyCharm as Interpreter Paths, for correct auto completion.
So, I have some class
class Sim():
def __init__(self, *args, **kwargs):
self._sim_info = None
I am Java and C++ programmer and I am used to class type variables.
I know, that in scripts this variable will have value with type SimInfo.
But, when indexer of the PyCharm indexed that *.py files, he knows, that Sim.sim_info var has value None. But how can specify, that in code
s1=Sim()
i=s1.sim_info
variable i has type of class SimInfo?
May be I should use something like "editor hits", to force auto completion for i.is_ghost?
For example, code
from simulation.sims.sim import Sim
from simulation.sims.sim_info import SimInfo
from simulation.sims.pregnancy.pregnancy_tracker import PregnancyOffspringData
s1=Sim()
i=s1.sim_info
i.is_ghos
where i.is_ghos must be auto completed to i.is_ghost().
How to specify variable types in this case (may be via something like editor hints)?
Thank you very much!

Python 3.6:
class Sim():
def __init__(self, *args, **kwargs):
self._sim_info: SimInfo = None
Other python:
class Sim():
def __init__(self, *args, **kwargs):
self._sim_info = None # type: SimInfo
It called "type hints"

You can use type hinting with pycharm using the #type docstring.
def __init__(self, *args, **kwargs):
# n.b., in your usage code you use `.sim_info`
# but in your constructor you used `._sim_info`
# since I didn’t see a prop get function, I assumed
# that you meant `.sim_info`
self.sim_info = None
"""#type: SimInfo"""
or
def __init__(self, *args, **kwargs):
self.sim_info = None
""":type SimInfo"""
You can also specify the full path to the class including the module if just the class name did not work. You can also use PEP484 in PyCharm to specify the type of member variables:
def __init__(self):
self.sim_info = None # type: SimInfo

Django management command extract common functionality by using a superclass

I have two similar management commands, with lot of common code. I want to put common code in a MyClass that extends NoArgsCommand and then create commands let us say CommandA and CommandB that extend MyClass. I have a handle method in CommandA and CommandB and trying to call super.handle. I am getting error type object 'super' has no attribute 'handle'

Valid python syntax for calling super is:
def handle(self, *args, **options):
super(CommandA, self).handle(*args, **options)
If you use python 3 then you can omit super() arguments:
def handle(self, *args, **options):
super().handle(*args, **options)

Run-time-patch python module

I am searching for a way to run a module while replacing imports. This would be the missing magic to implement run_patched in the following pseudocode.
from argparse import ArgumentParser
class ArgumentCounter(ArgumentParser):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
arg_counter = 0
def add_argument(self, *args, **kwargs):
super().add_argument(*args, **kwargs)
arg_counter += 1
def parse_args(self, *args, **kwargs):
super().parse_args(*args, **kwargs)
print(arg_counter)
run_patched('test.test_argparse', ArgumentParser = ArgumentCounter)
I know that single methods could be replaced by assignment, for example stating ArgumentParser.parse_args = print, so I was tempted to mess with globals like sys.modules and then execute the module by runpy.run_module.
Unfortunately, the whole strategy should be able to work in a multithreaded scenario. So the change should only affect the module executed while other parts of the program can continue to use the unpatched module(s) as if they were never touched.

Logging to specific error log file in scrapy

I am running a log of scrapy by doing this:
from scrapy import log
class MySpider(BaseSpider):
name = "myspider"
def __init__(self, name=None, **kwargs):
LOG_FILE = "logs/spider.log"
log.log.defaultObserver = log.log.DefaultObserver()
log.log.defaultObserver.start()
log.started = False
log.start(LOG_FILE, loglevel=log.INFO)
super(MySpider, self).__init__(name, **kwargs)
def parse(self,response):
....
raise Exception("Something went wrong!")
log.msg('Something went wrong!', log.ERROR)
# Somehow write to a separate error log here.
Then I run the spider like this:
scrapy crawl myspider
This would store all the log.INFO data as well as log.ERROR into spider.log.
If an error occurs, I would also like to store those details in a separate log file called spider_errors.log. It would make it easier to search for errors that occurred rather than trying to scan through the entire spider.log file (which could be huge).
Is there a way to do this?
EDIT:
Trying with PythonLoggingObserver:
def __init__(self, name=None, **kwargs):
LOG_FILE = 'logs/spider.log'
ERR_File = 'logs/spider_error.log'
observer = log.log.PythonLoggingObserver()
observer.start()
log.started = False
log.start(LOG_FILE, loglevel=log.INFO)
log.start(ERR_FILE, loglevel=log.ERROR)
But I get ERROR: No handlers could be found for logger "twisted"

Just let logging do the job. Try to use PythonLoggingObserver instead of DefaultObserver:
configure two loggers (one for INFO and one for ERROR messages) directly in python, or via fileconfig, or via dictconfig (see docs)
start it in spider's __init__:
def __init__(self, name=None, **kwargs):
# TODO: configure logging: e.g. logging.config.fileConfig("logging.conf")
observer = log.PythonLoggingObserver()
observer.start()
Let me know if you need help with configuring loggers.
EDIT:
Another option is to start two file log observers in __init__.py:
from scrapy.log import ScrapyFileLogObserver
from scrapy import log
class MySpider(BaseSpider):
name = "myspider"
def __init__(self, name=None, **kwargs):
ScrapyFileLogObserver(open("spider.log", 'w'), level=logging.INFO).start()
ScrapyFileLogObserver(open("spider_error.log", 'w'), level=logging.ERROR).start()
super(MySpider, self).__init__(name, **kwargs)
...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy how to set DEPTH_LIMIT from a command line argument - python

Related

Redirect unknown attributes

How to edit Python scripts with correct autocompletion?

Django management command extract common functionality by using a superclass

Run-time-patch python module

Logging to specific error log file in scrapy

Categories

Resources