Using scrapy I faced a problem of javascript rendered pages. For the site Forum Franchise for example the link http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69, trying to scrap the source html I couldn't retrieve any posts because they seem to be "appended" after the page is being rendered (Probably through javascript).
So i was looking on the net for a solution to this problem, and i came across https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/ .
I am completely new to PYPQ, but was hoping to take a shortcut and copy paste some code.
This worked perfectly for when i tried to scrap a single page. But then when i implemented this in scrapy i get the following error :
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
If i scrap a single page, then no error occurs, but when i set crawler to recursive mode, then right at the second link i get an error that python.exe stopped working and the above error.
I will searching for what this could be, and somewhere i read a QApplication object should only be initiated once.
Could someone please tell me what should be the proper implementation?
The Spider
# -*- coding: utf-8 -*-
import scrapy
import sys, traceback
from bs4 import BeautifulSoup as bs
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import ThreadItem, PostItem
from crawler.utils import utils
class IdeefranchiseSpider(CrawlSpider):
name = "ideefranchise"
allowed_domains = ["idee-franchise.com"]
start_urls = (
'http://www.idee-franchise.com/forum/',
# 'http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69',
)
rules = [
Rule(LinkExtractor(allow='/forum/'), callback='parse_thread', follow=True)
]
def parse_thread(self, response):
print "Parsing Thread", response.url
thread = ThreadItem()
thread['url'] = response.url
thread['domain'] = self.allowed_domains[0]
thread['title'] = self.get_thread_title(response)
thread['forumname'] = self.get_thread_forum_name(response)
thread['posts'] = self.get_thread_posts(response)
yield thread
# paginate if possible
next_page = response.css('fieldset.display-options > a::attr("href")')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_thread)
def get_thread_posts(self, response):
# using PYQTRenderor to reload page. I think this is where the problem
# occurs, when i initiate the PYQTPageRenderor object.
soup = bs(unicode(utils.PYQTPageRenderor(response.url).get_html()))
# sleep so that PYQT can render page
# time.sleep(5)
# comments
posts = []
for item in soup.select("div.post.bg2") + soup.select("div.post.bg1"):
try:
post = PostItem()
post['profile'] = item.select("p.author > strong > a")[0].get_text()
details = item.select('dl.postprofile > dd')
post['date'] = details[2].get_text()
post['content'] = item.select('div.content')[0].get_text()
# appending the comment
posts.append(post)
except:
e = sys.exc_info()[0]
self.logger.critical("ERROR GET_THREAD_POSTS %s", e)
traceback.print_exc(file=sys.stdout)
return posts
The PYPQ implementation
import sys
from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
class PYQTPageRenderor(object):
def __init__(self, url):
self.url = url
def get_html(self):
r = Render(self.url)
return unicode(r.frame.toHtml())
The proper implementation, if you want to do it yourself, would be to create a downlader middleware that uses PyQt to process a request. It will be instantiated once by Scrapy.
Should not be that complicated, just
Create QTDownloader class in the middleware.py file of your project
The constructor should create the QApplication object.
The process_request method should do the url loading, and HTML fetching. Note that you return a Response object with the HTML string.
You might do appropriate clean-up in a _cleanup method of your class.
Finally, activate your middleware by adding it to the DOWNLOADER_MIDDLEWARES variable of the settings.py file of your project.
If you don't want to write your own solution, you could use an existing middleware that uses Selenium to do the downloading, like scrapy-webdriver. If you don't want to have a visible browser, you can instruct it to use PhantomJS.
EDIT1:
So the proper way to do it, as pointed out by Rejected is to use a download handler. The idea is similar, but the downloading should happen in a download_request method, and it should be enabled by adding it to DOWNLOAD_HANDLERS. Take a look to the WebdriverDownloadHandler for an example.
Related
I'm trying to capture "finish_reason" in scrapy after each crawl and insert this info into a database. The crawl instance is created in a pipeline before first item is collected.
It seems like I have to use the "engine_stopped" signal but couldn't find an example on how or where should I put my code to do this?
One of possible options is to override scrapy.statscollectors.MemoryStatsCollector (docs,code) and it's close_spider method:
middleware.py:
import pprint
from scrapy.statscollectors import MemoryStatsCollector, logger
class MemoryStatsCollectorSender(MemoryStatsCollector):
#Override close_spider method
def close_spider(self, spider, reason):
#finish_reason in reason variable
#add your data sending code here
if self._dump:
logger.info("Dumping Scrapy stats:\n" + pprint.pformat(self._stats),
extra={'spider': spider})
self._persist_stats(self._stats, spider)
Add newly created stats collector class to settings.py:
STATS_CLASS = 'project.middlewares.MemoryStatsCollectorSender'
#STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
I have next model:
Command 'collect' (collect_positions.py) -> Celery task (tasks.py) -> ScrappySpider (MySpider) ...
collect_positions.py:
from django.core.management.base import BaseCommand
from tracker.models import Keyword
from tracker.tasks import positions
class Command(BaseCommand):
help = 'collect_positions'
def handle(self, *args, **options):
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
chunk_size = 1
keywords = Keyword.objects.filter(product=product).values_list('id', flat=True)
chunks_list = list(chunks(keywords, chunk_size))
positions.chunks(chunks_list, 1).apply_async(queue='collect_positions')
return 0
tasks.py:
from app_name.celery import app
from scrapy.settings import Settings
from scrapy_app import settings as scrapy_settings
from scrapy_app.spiders.my_spider import MySpider
from tracker.models import Keyword
from scrapy.crawler import CrawlerProcess
#app.task
def positions(*args):
s = Settings()
s.setmodule(scrapy_settings)
keywords = Keyword.objects.filter(id__in=list(args))
process = CrawlerProcess(s)
process.crawl(MySpider, keywords_chunk=keywords)
process.start()
return 1
I run the command through the command line, which creates tasks for parsing. The first queue completes successfully, but other returned an error:
twisted.internet.error.ReactorNotRestartable
Please tell me how can I fix this error?
I can provide any data if there is a need...
UPDATE 1
Thanks for the answer, #Chiefir! I managed to run all queues, but only the start_requests() function is started, and parse() does not run.
The main functions of the scrappy spider:
def start_requests(self):
print('STEP1')
yield scrapy.Request(
url='exmaple.com',
callback=self.parse,
errback=self.error_callback,
dont_filter=True
)
def error_callback(self, failure):
print(failure)
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
print(repr(failure))
# if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
print('HttpError on %s', response.url)
# elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
print('DNSLookupError on %s', request.url)
# elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
print('TimeoutError on %s', request.url)
def parse(self, response):
print('STEP2', response)
In the console I get:
STEP1
What could be the reason?
This is old question as a world:
This is what helped for me to win the battle against ReactorNotRestartable error: last answer from the author of the question
0) pip install crochet
1) import from crochet import setup
2) setup() - at the top of the file
3) remove 2 lines:
a) d.addBoth(lambda _: reactor.stop())
b) reactor.run()
I had the same problem with this error, and spend 4+ hours to solve this problem, read all questions here about it. Finally found that one - and share it. That is how i solved this. The only meaningful lines from Scrapy docs left are 2 last lines in this my code:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
This code allows me to select what spider to run just with its name passed to run_spider function and after scrapping finishes - select another spider and run it again.
In your case you need in separate file create separate function which runs your spiders and run it from your task. Usually I do in this way :)
P.S. And really there is no way to restart the TwistedReactor.
UPDATE 1
I don't know if you need to call a start_requests() method. For me it usually works just with this code:
class mySpider(scrapy.Spider):
name = "somname"
allowed_domains = ["somesite.com"]
start_urls = ["https://somesite.com"]
def parse(self, response):
pass
def parse_dir_contents(self, response): #for crawling additional links
pass
You can fix this by setting the parameter stop_after_crawl to False on the start method of CrawlerProcess:
stop_after_crawl (bool) – stop or not the reactor when all crawlers have finished
#shared_task
def crawl(m_id, *args, **kwargs):
process = CrawlerProcess(get_project_settings(), install_root_handler=False)
process.crawl(SpiderClass, m_id=m_id)
process.start(stop_after_crawl=False)
Aim: Trigger the execution of an XMLFeedSpider by passing the response as an argument (i.e. no need for start_urls).
Example Command:
scrapy crawl spider_name -a response_as_string="<xml><sometag>abc123</sometag></xml>"
Example Spider:
class ExampleXmlSpider(XMLFeedSpider):
name = "spider_name"
itertag = 'sometag'
def parse_node(self, response, node):
response2 = XmlResponse(url="Some URL", body=self.response_as_string)
ProcessResponse().get_data(response2)
def __init__(self, response_as_string=''):
self.response_as_string = response_as_string
Problem: Terminal complains that there is no start_urls. I can only get the above to work if I include a dummy.xml within start_urls.
E.g.
start_urls = ['file:///home/user/dummy.xml']
Question: Is there anyway to have an XMLFeedSpider that is purely driven by a response provided by an argument (as per the original command)? In which case I would need to suppress the need for the XMLFeedSpider to seek out a start_url to execute a request.
Thanks Paul, you were spot on. Updated example code below. I stopped referring to the class as an XMLFeedSpider. Python script updated to be a class of type "object" with the ability to pass the url and body as arguments.
from scrapy.http import XmlResponse
class ExampleXmlSpider(object):
def __init__(self, response_url='', response_body=''):
self. response_url = response_url
self.response_body = response_body
def run(self):
response = XmlResponse(url=self.response_url, body=self.response_body)
print response.url
I want to parse sitemap and find out all urls from sitemap and then appending some word to all urls and then I want to check response code of all modified urls.
for this task I decided to use scrapy because it have luxury to crawl sitemaps. its given in scarpy's documentation
with the help of this documentation I created my spider. but I want to change urls before sending for fetching. so for this I tried to take help from this link. this link suggested my to use rules and implement process_requests(). but I am not able to make use of these. I tired little bit that I have commented. could anyone help me write exact code for commented lines or any other ways to do this task in scrapy?
from scrapy.contrib.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = ['http://www.example.com/sitemap.xml']
#sitemap_rules = [some_rules, process_request='process_request')]
#def process_request(self, request, spider):
# modified_url=orginal_url_from_sitemap + 'myword'
# return request.replace(url = modified_url)
def parse(self, response):
print response.status, response.url
You can attach the request_scheduled signal to a function and do what you want in the function. For example
class MySpider(SitemapSpider):
#classmethod
def from_crawler(cls, crawler):
spider = cls()
crawler.signals.connect(spider.request_scheduled, signals.request_scheduled)
def request_scheduled(self, request, spider):
modified_url = orginal_url_from_sitemap + 'myword'
request.url = modified_url
SitemapSpider has sitemap_filter method.
You can override it to implement required functionality.
class MySpider(SitemapSpider):
...
def sitemap_filter(self, entries):
for entry in entries:
entry["loc"] = entry["loc"] + myword
yield entry
Each of that entry objects are dicts with structure like this:
<class 'dict'>:
{'loc': 'https://example.com/',
'lastmod': '2019-01-04T08:09:23+00:00',
'changefreq': 'weekly',
'priority': '0.8'}
Important note!. SitemapSpider.sitemap_filter method appeared on scrapy 1.6.0 released on Jan 2019 1.6.0 release notes - new extensibility features section
I've just facet this. Apparently you can't really use process_requests because sitemap rules in SitemapSpider are different from Rule objects in CrawlSpider - only the latter can have this argument.
After examining the code it looks like this can be avoided by manually overriding part of SitemapSpider implementation:
class MySpider(SitemapSpider):
sitemap_urls = ['...']
sitemap_rules = [('/', 'parse')]
def start_requests(self):
# override to call custom_parse_sitemap instead of _parse_sitemap
for url in self.sitemap_urls:
yield Request(url, self.custom_parse_sitemap)
def custom_parse_sitemap(self, response):
# modify requests marked to be called with parse callback
for request in super()._parse_sitemap(response):
if request.callback == self.parse:
yield self.modify_request(request)
else:
yield request
def modify_request(self, request):
return request.replace(
# ...
)
def parse(self, response):
# ...
I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.