I have a Flask project running with a subprocess call to a Scrapy spider:
class Utilities(object):
#staticmethod
def scrape(inputs):
job_id = str(uuid.uuid4())
project_folder = os.path.abspath(os.path.dirname(__file__))
subprocess.call(['scrapy', 'crawl', "ExampleCrawler", "-a", "inputs=" + str(inputs), "-s", "JOB_ID=" + job_id],
cwd="%s/scraper" % project_folder)
return job_id
Even though I have 'Attach to subprocess automatically while debugging' enabled in the projects' Python debugger, the breakpoints inside the spider won't work. The first breakpoint which works again would be one on return job_id.
Here is a part of the code from the spider where I expect the breakpoints to work:
from scrapy.http import FormRequest
from scrapy.spiders import Spider
from scrapy.loader import ItemLoader
from Handelsregister_Scraper.scraper.items import Product
import re
class ExampleCrawler(Spider):
name = "ExampleCrawler"
def __init__(self, inputs='', *args, **kwargs):
super(ExampleCrawler, self).__init__(*args, **kwargs)
self.start_urls = ['https://www.example-link.com']
self.inputs = inputs
def parse(self, response):
yield FormRequest(self.start_urls[0], callback=self.parse_list_elements, formdata=self.inputs)
I can't find any solution to this other than enabling said option which i did.
Any suggestions how to get breakpoints inside the spider work?
debugger doesn't work because is not a subprocess, but an external call. see this answer for possible workarounds.
Try configuring this python running configuration in "Edit Configurations". After that, run in debug mode.
Related
Is it possible to override Scrapy settings after the init function of a spider?
For example if I want to get settings from db and I pass my query parameters as arguments from the cmdline.
def __init__(self, spider_id, **kwargs):
self.spider_id = spider_id
self.set_params(spider_id)
super(Base_Crawler, self).__init__(**kwargs)
def set_params(self):
#TODO
#makes a query in db
#get set variables from query result
#override settings
Technically you can "override" settings after initialization of spider however it would affect nothing because most of them applied earlier.
What you can actually do is to pass parameters to Spider as command-line options using -a and override project settings using -s, for ex.)
Spider:
class TheSpider(scrapy.Spider):
name = 'thespider'
def __init__(self, *args, **kwargs):
self.spider_id = kwargs.pop('spider_id', None)
super(TheSpider).__init__(*args, **kwargs)
CLI:
scrapy crawl thespider -a spider_id=XXX -s SETTTING_TO_OVERRIDE=YYY
If you need something more advanced consider to write custom runner wrapping your spider. Below is example from the docs:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
Just replace get_project_settings with your own routine that returns Settings instance.
Anyway, avoid of overloading of spider's code with non-scraping logic to keep it clean and reusable.
I'm using scrapy,in my scrapy project,I created several spider classes,as the official document said,I used this way to specify log file name:
def logging_to_file(file_name):
"""
#rtype: logging
#type file_name:str
#param file_name:
#return:
"""
import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename=filename+'.txt',
filemode='a',
format='%(levelname)s: %(message)s',
level=logging.DEBUG,
)
Class Spider_One(scrapy.Spider):
name='xxx1'
logging_to_file(name)
......
Class Spider_Two(scrapy.Spider):
name='xxx2'
logging_to_file(name)
......
Now,if I start Spider_One,everything is correct!But,if I start Spider Two,the log file of Spider Two will also be named with the name of Spider One!
I have searched many answers from google and stackoverflow,but unfortunately,none worked!
I am using python 2.7 & scrapy 1.1!
Hope anyone can help me!
It's because you initiate logging_to_file every time when you load up your package. You are using a class variable here where you should use instance variable.
When python loads in your package or module ir loads every class and so on.
class MyClass:
# everything you do here is loaded everytime when package is loaded
name = 'something'
def __init__(self):
# everything you do here is loaded ONLY when the object is created
# using this class
To resolve your issue just move logging_to_file function call to your spiders __init__() method.
class MyClass(Spider):
name = 'xx1'
def __init__(self):
super(MyClass, self).__init__()
logging_to_file(self.name)
One option is to define the LOG_FILE setting in Spider.custom_settings:
from scrapy import Spider
class MySpider(Spider):
name = "myspider"
start_urls = ["https://toscrape.com"]
custom_settings = {
"LOG_FILE": f"{name}.log",
}
def parse(self, response):
pass
However, because logging starts before the spider custom settings are read, the first few lines of the log will still go into the standard error output.
I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.
My scraper works fine when I run it from the command line, but when I try to run it from within a python script (with the method outlined here using Twisted) it does not output the two CSV files that it normally does. I have a pipeline that creates and populates these files, one of them using CsvItemExporter() and the other using writeCsvFile(). Here is the code:
class CsvExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
nodes = open('%s_nodes.csv' % spider.name, 'w+b')
self.files[spider] = nodes
self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot'])
self.exporter1.start_exporting()
self.edges = []
self.edges.append(['Source','Target','Type','ID','Label','Weight'])
self.num = 1
def spider_closed(self, spider):
self.exporter1.finish_exporting()
file = self.files.pop(spider)
file.close()
writeCsvFile(getcwd()+r'\edges.csv', self.edges)
def process_item(self, item, spider):
self.exporter1.export_item(item)
for url in item['links']:
self.edges.append([item['url'],url,'Directed',self.num,'',1])
self.num += 1
return item
Here is my file structure:
SiteCrawler/ # the CSVs are normally created in this folder
runspider.py # this is the script that runs the scraper
scrapy.cfg
SiteCrawler/
__init__.py
items.py
pipelines.py
screenshooter.py
settings.py
spiders/
__init__.py
myfuncs.py
sitecrawler_spider.py
The scraper appears to function normally in all other ways. The output at the end in the command line suggests that the expected number of pages were crawled and the spider appears to have finished normally. I am not getting any error messages.
---- EDIT : ----
Inserting print statements and syntax errors into the pipeline has no effect, so it appears that the pipeline is being ignored. Why might this be?
Here is the code for the script that runs the scraper (runspider.py):
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher
import logging
from SiteCrawler.spiders.sitecrawler_spider import MySpider
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
Replacing "from scrapy.settings import Settings" with "from scrapy.utils.project import get_project_settings as Settings" fixed the problem.
The solution was found here. No explanation of the solution was provided.
alecxe has provided an example of how to run Scrapy from inside a Python script.
EDIT:
Having read through alecxe's post in more detail, I can now see the difference between "from scrapy.settings import Settings" and "from scrapy.utils.project import get_project_settings as Settings". The latter allows you to use your project's settings file, as opposed to a defualt settings file. Read alecxe's post (linked to above) for more detail.
In my project i call the scrapy code inside another python script using os.system
import os
os.chdir('/home/admin/source/scrapy_test')
command = "scrapy crawl test_spider -s FEED_URI='file:///home/admin/scrapy/data.csv' -s LOG_FILE='/home/admin/scrapy/scrapy_test.log'"
return_code = os.system(command)
print 'done'
i have multiple spiders in one project , problem is right now i am defining LOG_FILE in SETTINGS like
LOG_FILE = "scrapy_%s.log" % datetime.now()
what i want is scrapy_SPIDERNAME_DATETIME
but i am unable to provide spidername in log_file name ..
i found
scrapy.log.start(logfile=None, loglevel=None, logstdout=None)
and called it in each spider init method but its not working ..
any help would be appreciated
The spider's __init__() is not early enough to call log.start() by itself since the log observer is already started at this point; therefore, you need to reinitialize the logging state to trick Scrapy into (re)starting it.
In your spider class file:
from datetime import datetime
from scrapy import log
from scrapy.spider import BaseSpider
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
def __init__(self, name=None, **kwargs):
LOG_FILE = "scrapy_%s_%s.log" % (self.name, datetime.now())
# remove the current log
# log.log.removeObserver(log.log.theLogPublisher.observers[0])
# re-create the default Twisted observer which Scrapy checks
log.log.defaultObserver = log.log.DefaultObserver()
# start the default observer so it can be stopped
log.log.defaultObserver.start()
# trick Scrapy into thinking logging has not started
log.started = False
# start the new log file observer
log.start(LOG_FILE)
# continue with the normal spider init
super(ExampleSpider, self).__init__(name, **kwargs)
def parse(self, response):
...
And the output file might look like:
scrapy_example_2012-08-25 12:34:48.823896.log
There should be a BOT_NAME in your settings.py. This is the project/spider name. So in your case, this would be
LOG_FILE = "scrapy_%s_%s.log" % (BOT_NAME, datetime.now())
This is pretty much the same that Scrapy does internally
But why not use log.msg. The docs clearly state that this is for spider specific stuff. It might be easier to use this and just extract/grep/... the different spider log messages from a big log file.
A more compicated approach would be to get the location of the spider SPIDER_MODULES list and load all spiders inside these package.
You can use Scrapy's Storage URI parameters in your settings.py file for FEED URI.
%(name)s
%(time)s
For example: /tmp/crawled/%(name)s/%(time)s.log