I have next model:
Command 'collect' (collect_positions.py) -> Celery task (tasks.py) -> ScrappySpider (MySpider) ...
collect_positions.py:
from django.core.management.base import BaseCommand
from tracker.models import Keyword
from tracker.tasks import positions
class Command(BaseCommand):
help = 'collect_positions'
def handle(self, *args, **options):
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
chunk_size = 1
keywords = Keyword.objects.filter(product=product).values_list('id', flat=True)
chunks_list = list(chunks(keywords, chunk_size))
positions.chunks(chunks_list, 1).apply_async(queue='collect_positions')
return 0
tasks.py:
from app_name.celery import app
from scrapy.settings import Settings
from scrapy_app import settings as scrapy_settings
from scrapy_app.spiders.my_spider import MySpider
from tracker.models import Keyword
from scrapy.crawler import CrawlerProcess
#app.task
def positions(*args):
s = Settings()
s.setmodule(scrapy_settings)
keywords = Keyword.objects.filter(id__in=list(args))
process = CrawlerProcess(s)
process.crawl(MySpider, keywords_chunk=keywords)
process.start()
return 1
I run the command through the command line, which creates tasks for parsing. The first queue completes successfully, but other returned an error:
twisted.internet.error.ReactorNotRestartable
Please tell me how can I fix this error?
I can provide any data if there is a need...
UPDATE 1
Thanks for the answer, #Chiefir! I managed to run all queues, but only the start_requests() function is started, and parse() does not run.
The main functions of the scrappy spider:
def start_requests(self):
print('STEP1')
yield scrapy.Request(
url='exmaple.com',
callback=self.parse,
errback=self.error_callback,
dont_filter=True
)
def error_callback(self, failure):
print(failure)
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
print(repr(failure))
# if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
print('HttpError on %s', response.url)
# elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
print('DNSLookupError on %s', request.url)
# elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
print('TimeoutError on %s', request.url)
def parse(self, response):
print('STEP2', response)
In the console I get:
STEP1
What could be the reason?
This is old question as a world:
This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()
I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
In your case you need in separate file create separate function which runs your spiders and run it from your task. Usually I do in this way :)
P.S. And really there is no way to restart the TwistedReactor.
UPDATE 1
I don't know if you need to call a start_requests() method. For me it usually works just with this code:
class mySpider(scrapy.Spider):
name = "somname"
allowed_domains = ["somesite.com"]
start_urls = ["https://somesite.com"]
def parse(self, response):
pass
def parse_dir_contents(self, response): #for crawling additional links
pass
You can fix this by setting the parameter stop_after_crawl to False on the start method of CrawlerProcess:
stop_after_crawl (bool) – stop or not the reactor when all crawlers have finished
#shared_task
def crawl(m_id, *args, **kwargs):
process = CrawlerProcess(get_project_settings(), install_root_handler=False)
process.crawl(SpiderClass, m_id=m_id)
process.start(stop_after_crawl=False)
Related
I was trying to use the following function to wait for a crawler to finish and return all results. However, this function always returns immediately when called while the crawler is still running. What am I missing here? Aren't join() supposed to wait?
def spider_results():
runner = CrawlerRunner(get_project_settings())
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
runner.crawl(QuotesSpider)
runner.join()
return results
Accordig to scrapy docs (common practices section)
CrawlerProcess class is recommended to use in this cases.
I want to run multiple spiders, so i try to use CrawlerProcess. But i find the method open_spider will run two times at the beginning and the end with process_item method.
It causes when the spider open , i remove my collection and save the data into mongodb completed. It will remove my collection again finally.
How do i fix the issue and why the method open_spider run two times ?
I tyep scrapy crawl movies run the project:
Here is my movies.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import time
# scrapy api imports
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from Tainan.FirstSpider import FirstSpider
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
process = CrawlerProcess(get_project_settings())
process.crawl(FirstSpider)
process.start()
It's my FirstSpider.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
class FirstSpider(scrapy.Spider):
name = 'first'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
def parse(self, response):
movieHrefs = response.xpath('//*[#class="release_movie_name"]/a/#href').extract()
for movieHref in movieHrefs:
yield Request(movieHref, callback=self.parse_page)
def parse_page(self, response):
print 'FirstSpider => parse_page'
movieImage = response.xpath('//*[#class="foto"]/img/#src').extract()
cnName = response.xpath('//*[#class="movie_intro_info_r"]/h1/text()').extract()
enName = response.xpath('//*[#class="movie_intro_info_r"]/h3/text()').extract()
movieDate = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[0].extract()
movieTime = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[1].extract()
imdbScore = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[3].extract()
movieContent = response.xpath('//*[#class="gray_infobox_inner"]/span/text()').extract_first().strip()
yield {'image': movieImage, 'cnName': cnName, 'enName': enName, 'movieDate': movieDate, 'movieTime': movieTime, 'imdbScore': imdbScore, 'movieContent': movieContent}
It's my pipelines.py:
from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):
global open_count
open_count = 1
global process_count
process_count = 1
def __init__(self):
connection = MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
# My issue is here it will print open_spider count = 2 finally.
def open_spider(self, spider):
global open_count
print 'Pipelines => open_spider count =>'
print open_count
open_count += 1
self.collection.remove({})
# open_spider method call first time and process_item save data to my mongodb.
# but when process_item completed, open_spider method run again...it cause my data that i have saved it has been removed.
def process_item(self, item, spider):
global process_count
print 'Pipelines => process_item count =>'
print process_count
process_count += 1
self.collection.insert(dict(item))
return item
I can't figure it out, some one can help me out that would be appreciated. Thanks in advance.
How do i fix the issue and why the method open_spider run two times ?
The open_spider method runs once per spider, and you're running two spiders.
I tyep scrapy crawl movies run the project
The crawl command will run the spider named movies (MoviesSpider).
To do this, it has to import the movies module, which will cause it to run your FirstSpider as well.
Now, how to fix this depends on what you want to do.
Maybe you should only run a single spider, or have separate settings per spider, or maybe something entirely different.
I have a problem. I need to stop the execution of a function for a while, but not stop the implementation of parsing as a whole. That is, I need a non-blocking pause.
It's looks like:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=self.second_parse_function)
# Here I need some function for sleep only this function like time.sleep(10)
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
Function non_stop_function needs to be stopped for a while, but it should not block the rest of the output.
If I insert time.sleep() - it will stop the whole parser, but I don't need it. Is it possible to stop one function using twisted or something else?
Reason: I need to create a non-blocking function that will parse the page of the website every n seconds. There she will get urls and fill for 10 seconds. URLs that have been obtained will continue to work, but the main feature needs to sleep.
UPDATE:
Thanks to TkTech and viach. One answer helped me to understand how to make a pending Request, and the second is how to activate it. Both answers complement each other and I made an excellent non-blocking pause for Scrapy:
def call_after_pause(self, response):
d = Deferred()
reactor.callLater(10.0, d.callback, Request(
'https://example.com/',
callback=self.non_stop_function,
dont_filter=True))
return d
And use this function for my request:
yield Request('https://example.com/', callback=self.call_after_pause, dont_filter=True)
Request object has callback parameter, try to use that one for the purpose.
I mean, create a Deferred which wraps self.second_parse_function and pause.
Here is my dirty and not tested example, changed lines are marked.
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
parse_and_pause = Deferred() # changed
parse_and_pause.addCallback(self.second_parse_function) # changed
parse_and_pause.addCallback(pause, seconds=10) # changed
for url in ['url1', 'url2', 'url3', 'more urls']:
yield Request(url, callback=parse_and_pause) # changed
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
If the approach works for you then you can create a function which constructs a Deferred object according to the rule. It could be implemented in the way like the following:
def get_perform_and_pause_deferred(seconds, fn, *args, **kwargs):
d = Deferred()
d.addCallback(fn, *args, **kwargs)
d.addCallback(pause, seconds=seconds)
return d
And here is possible usage:
class ScrapySpider(Spider):
name = 'live_function'
def start_requests(self):
yield Request('some url', callback=self.non_stop_function)
def non_stop_function(self, response):
for url in ['url1', 'url2', 'url3', 'more urls']:
# changed
yield Request(url, callback=get_perform_and_pause_deferred(10, self.second_parse_function))
yield Request('some url', callback=self.non_stop_function) # Call itself
def second_parse_function(self, response):
pass
If you're attempting to use this for rate limiting, you probably just want to use DOWNLOAD_DELAY instead.
Scrapy is just a framework on top of Twisted. For the most part, you can treat it the same as any other twisted app. Instead of calling sleep, just return the next request to make and tell twisted to wait a bit. Ex:
from twisted.internet import reactor, defer
def non_stop_function(self, response)
d = defer.Deferred()
reactor.callLater(10.0, d.callback, Request(
'some url',
callback=self.non_stop_function
))
return d
The asker already provides an answer in the question's update, but I want to give a slightly better version so it's reusable for any request.
# removed...
from twisted.internet import reactor, defer
class MySpider(scrapy.Spider):
# removed...
def request_with_pause(self, response):
d = defer.Deferred()
reactor.callLater(response.meta['time'], d.callback, scrapy.Request(
response.url,
callback=response.meta['callback'],
dont_filter=True, meta={'dont_proxy':response.meta['dont_proxy']}))
return d
def parse(self, response):
# removed....
yield scrapy.Request(the_url, meta={
'time': 86400,
'callback': self.the_parse,
'dont_proxy': True
}, callback=self.request_with_pause)
For explanation, Scrapy use Twisted to manage the request asynchronously, so we need Twisted's tool to do a delayed request too.
I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.
My scraper works fine when I run it from the command line, but when I try to run it from within a python script (with the method outlined here using Twisted) it does not output the two CSV files that it normally does. I have a pipeline that creates and populates these files, one of them using CsvItemExporter() and the other using writeCsvFile(). Here is the code:
class CsvExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
nodes = open('%s_nodes.csv' % spider.name, 'w+b')
self.files[spider] = nodes
self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot'])
self.exporter1.start_exporting()
self.edges = []
self.edges.append(['Source','Target','Type','ID','Label','Weight'])
self.num = 1
def spider_closed(self, spider):
self.exporter1.finish_exporting()
file = self.files.pop(spider)
file.close()
writeCsvFile(getcwd()+r'\edges.csv', self.edges)
def process_item(self, item, spider):
self.exporter1.export_item(item)
for url in item['links']:
self.edges.append([item['url'],url,'Directed',self.num,'',1])
self.num += 1
return item
Here is my file structure:
SiteCrawler/ # the CSVs are normally created in this folder
runspider.py # this is the script that runs the scraper
scrapy.cfg
SiteCrawler/
__init__.py
items.py
pipelines.py
screenshooter.py
settings.py
spiders/
__init__.py
myfuncs.py
sitecrawler_spider.py
The scraper appears to function normally in all other ways. The output at the end in the command line suggests that the expected number of pages were crawled and the spider appears to have finished normally. I am not getting any error messages.
---- EDIT : ----
Inserting print statements and syntax errors into the pipeline has no effect, so it appears that the pipeline is being ignored. Why might this be?
Here is the code for the script that runs the scraper (runspider.py):
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher
import logging
from SiteCrawler.spiders.sitecrawler_spider import MySpider
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
Replacing "from scrapy.settings import Settings" with "from scrapy.utils.project import get_project_settings as Settings" fixed the problem.
The solution was found here. No explanation of the solution was provided.
alecxe has provided an example of how to run Scrapy from inside a Python script.
EDIT:
Having read through alecxe's post in more detail, I can now see the difference between "from scrapy.settings import Settings" and "from scrapy.utils.project import get_project_settings as Settings". The latter allows you to use your project's settings file, as opposed to a defualt settings file. Read alecxe's post (linked to above) for more detail.
In my project i call the scrapy code inside another python script using os.system
import os
os.chdir('/home/admin/source/scrapy_test')
command = "scrapy crawl test_spider -s FEED_URI='file:///home/admin/scrapy/data.csv' -s LOG_FILE='/home/admin/scrapy/scrapy_test.log'"
return_code = os.system(command)
print 'done'