I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.
Related
i'm just getting started using scrapy and i'd like to do the following
Have a list of n domains
i=0
loop for i to n
Use a (mostly) generic CrawlSpider to get all links (a href) of domain[i]
Save results as json lines
to do this, the Spider needs to receive the domain it has to crawl as an argument.
I already successfully created the CrawlSpider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
class MyItem(Item):
#MyItem Fields
class SubsiteSpider(CrawlSpider):
name = "subsites"
start_urls = []
allowed_domains = []
rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)
def __init__(self, starturl, allowed, *args, **kwargs):
print(args)
self.start_urls.append(starturl)
self.allowed_domains.append(allowed)
super().__init__(**kwargs)
def parse_obj(self, response):
item = MyItem()
#fill Item Fields
return item
process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(SubsiteSpider)
process.start()
If i call it with scrapy crawl subsites -a starturl=http://example.com -a allowed=example.com -o output.jl
the result is exactly as i want it, so this part is fine already.
What i fail to do is create multiple instances of SubsiteSpider, each with a different domain as argument.
I tried (in SpiderRunner.py)
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('subsites', ['https://example.com', 'example.com'])
process.start()
Variant:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
allowed = ["example.com"]
start = ["https://example.com"]
process.crawl('subsites', start, allowed)
process.start()
But i get an error that occurs, i presume, because the argument is not properly passed to __init__, for example TypeError: __init__() missing 1 required positional argument: 'allowed' or TypeError: __init__() missing 2 required positional arguments: 'starturl' and 'allowed'
(Loop is yet to be implemented)
So, here are my questions:
1) What is the proper way to pass arguments to init, if i do not start crawling via scrapy shell, but from within python code?
2) How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?)
3) I am fine with this running each spider after another - would it still be considered best / good practice to do it that way? Could you point to a more extensive tutorial about "running the same spider again and again, with different arguments(=target domains), optionally parallel", if there is one?
Thank you all very much in advance!
If there are any spelling mistakes (not an english native speaker), or if question / details are not precise enough, please tell me how to correct them.
There are a few problems with your code:
start_urls and allowed_domains are class attributes which you modify in __init__(), making them shared across all instances of your class.
What you should do instead is make them instance attributes:
class SubsiteSpider(CrawlSpider):
name = "subsites"
rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)
def __init__(self, starturl, allowed, *args, **kwargs):
self.start_urls = [starturl]
self.allowed_domains = [allowed]
super().__init__(*args, **kwargs)
Those last 3 lines should not be in the file with you spider class, since you probably don't want to run that code each time your spider is imported.
Your calling of CrawlProcess.crawl() is slightly wrong. You can use it like this, passing the arguments in the same manner you'd pass them to the spider class' __init__().
process = CrawlerProcess(get_project_settings())
process.crawl('subsites', 'https://example.com', 'example.com')
process.start()
How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?
You can achieve the same effect using custom_settings, giving each instance a different FEED_URI setting.
I have found plenty of information on calling a function when a Scrapy spider quits (viz: Call a function in Settings from spider Scrapy) but I'm looking for how to call a function -- just once -- when the spider opens. Cannot find this in the Scrapy documentation.
I've got a project of multiple spiders that scrape event information and post them to different Google Calendars. The event information is updated often, so before the spider runs, I need to clear out the existing Google Calendar information in order to refresh it entirely. I've got a working function that accomplishes this when passed a calendar ID. Each spider posts to a different Google Calendar, so I need to be able to pass the calendar ID from within the spider to the function that clears the calendar.
I've defined a base spider in init.py that looks like this:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
## import other stuff I need for the clear_calendar() function
class BaseSpider(CrawlSpider):
def clear_calendar(self, CalId):
## working code to clear the calendar
Now I can call that function from within parse_item like:
from myproject import BaseSpider
class ExampleSpider(BaseSpider):
def parse_item(self, response):
calendarID = 'MycalendarID'
self.clear_calendar(MycalendarID)
## other stuff to do
And of course that calls the function every single time an item is scraped, which is ridiculous. But if I move the function call outside of def parse_item, I get the error "self is not defined", or, if I remove "self", "clear_calendar is not defined."
How can I call a function that requires an argument just once from within a Scrapy spider? Or, is there a better way to go about this?
There is totally a better way, with the spider_opened signal.
I think on newer versions of scrapy, there is a spider_opened method ready for you to use inside the spider:
class MySpider(Spider):
...
calendar_id = 'something'
#classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
return spider
def spider_opened(self):
calendar_id = self.calendar_id
# use my calendar_id
Using scrapy I faced a problem of javascript rendered pages. For the site Forum Franchise for example the link http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69, trying to scrap the source html I couldn't retrieve any posts because they seem to be "appended" after the page is being rendered (Probably through javascript).
So i was looking on the net for a solution to this problem, and i came across https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/ .
I am completely new to PYPQ, but was hoping to take a shortcut and copy paste some code.
This worked perfectly for when i tried to scrap a single page. But then when i implemented this in scrapy i get the following error :
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
If i scrap a single page, then no error occurs, but when i set crawler to recursive mode, then right at the second link i get an error that python.exe stopped working and the above error.
I will searching for what this could be, and somewhere i read a QApplication object should only be initiated once.
Could someone please tell me what should be the proper implementation?
The Spider
# -*- coding: utf-8 -*-
import scrapy
import sys, traceback
from bs4 import BeautifulSoup as bs
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from crawler.items import ThreadItem, PostItem
from crawler.utils import utils
class IdeefranchiseSpider(CrawlSpider):
name = "ideefranchise"
allowed_domains = ["idee-franchise.com"]
start_urls = (
'http://www.idee-franchise.com/forum/',
# 'http://www.idee-franchise.com/forum/viewtopic.php?f=3&t=69',
)
rules = [
Rule(LinkExtractor(allow='/forum/'), callback='parse_thread', follow=True)
]
def parse_thread(self, response):
print "Parsing Thread", response.url
thread = ThreadItem()
thread['url'] = response.url
thread['domain'] = self.allowed_domains[0]
thread['title'] = self.get_thread_title(response)
thread['forumname'] = self.get_thread_forum_name(response)
thread['posts'] = self.get_thread_posts(response)
yield thread
# paginate if possible
next_page = response.css('fieldset.display-options > a::attr("href")')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_thread)
def get_thread_posts(self, response):
# using PYQTRenderor to reload page. I think this is where the problem
# occurs, when i initiate the PYQTPageRenderor object.
soup = bs(unicode(utils.PYQTPageRenderor(response.url).get_html()))
# sleep so that PYQT can render page
# time.sleep(5)
# comments
posts = []
for item in soup.select("div.post.bg2") + soup.select("div.post.bg1"):
try:
post = PostItem()
post['profile'] = item.select("p.author > strong > a")[0].get_text()
details = item.select('dl.postprofile > dd')
post['date'] = details[2].get_text()
post['content'] = item.select('div.content')[0].get_text()
# appending the comment
posts.append(post)
except:
e = sys.exc_info()[0]
self.logger.critical("ERROR GET_THREAD_POSTS %s", e)
traceback.print_exc(file=sys.stdout)
return posts
The PYPQ implementation
import sys
from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
class PYQTPageRenderor(object):
def __init__(self, url):
self.url = url
def get_html(self):
r = Render(self.url)
return unicode(r.frame.toHtml())
The proper implementation, if you want to do it yourself, would be to create a downlader middleware that uses PyQt to process a request. It will be instantiated once by Scrapy.
Should not be that complicated, just
Create QTDownloader class in the middleware.py file of your project
The constructor should create the QApplication object.
The process_request method should do the url loading, and HTML fetching. Note that you return a Response object with the HTML string.
You might do appropriate clean-up in a _cleanup method of your class.
Finally, activate your middleware by adding it to the DOWNLOADER_MIDDLEWARES variable of the settings.py file of your project.
If you don't want to write your own solution, you could use an existing middleware that uses Selenium to do the downloading, like scrapy-webdriver. If you don't want to have a visible browser, you can instruct it to use PhantomJS.
EDIT1:
So the proper way to do it, as pointed out by Rejected is to use a download handler. The idea is similar, but the downloading should happen in a download_request method, and it should be enabled by adding it to DOWNLOAD_HANDLERS. Take a look to the WebdriverDownloadHandler for an example.
Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.
They were trying to run the command:
scrapy crawl mininova.org -o scraped_data.json -t json
I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?
TL;DR: see Self-contained minimum example script to run scrapy.
First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.
If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special scrapy command-line tool:
scrapy crawl myspider
But, Scrapy also provides an API to run crawling from a script.
There are several key concepts that should be mentioned:
Settings class - basically a key-value "container" which is initialized with default built-in values
Crawler class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
Twisted reactor - since Scrapy is built-in on top of twisted asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:
The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and
dispatches events or messages in a program. It works by calling some
internal or external “event provider”, which generally blocks until an
event has arrived, and then calls the relevant event handler
(“dispatches the event”). The reactor provides basic interfaces to a
number of services, including network communications, threading, and
event dispatching.
Here is a basic and simplified process of running Scrapy from script:
create a Settings instance (or use get_project_settings() to use existing settings):
settings = Settings() # or settings = get_project_settings()
instantiate Crawler with settings instance passed in:
crawler = Crawler(settings)
instantiate a spider (this is what it is all about eventually, right?):
spider = MySpider()
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor needs to be stopped manually. Scrapy docs suggest to stop the reactor in the spider_closed signal handler:
Note that you will also have to shutdown the Twisted reactor yourself
after the spider is finished. This can be achieved by connecting a
handler to the signals.spider_closed signal.
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
# stats here is a dictionary of crawling stats that you usually see on the console
# here we need to stop the reactor
reactor.stop()
crawler.signals.connect(callback, signal=signals.spider_closed)
configure and start crawler instance with a spider passed in:
crawler.configure()
crawler.crawl(spider)
crawler.start()
optionally start logging:
log.start()
start the reactor - this would block the script execution:
reactor.run()
Here is an example self-contained script that is using DmozSpider spider and involves item loaders with input and output processors and item pipelines:
import json
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
desc_out = Join()
# define a pipeline
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
# define a spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/#href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
'__main__.JsonWriterPipeline': 100
})
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = DmozSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()
Run it in a usual way:
python runner.py
and observe items exported to items.jl with the help of the pipeline:
{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...
Gist is available here (feel free to improve):
Self-contained minimum example script to run scrapy
Notes:
If you define settings by instantiating a Settings() object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMIT or tweak any other setting, you need to either set it in the script via settings.set() (as demonstrated in the example):
pipelines = {
'mypackage.pipelines.FilterPipeline': 100,
'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')
or, use an existing settings.py with all the custom settings preconfigured:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
Other useful links on the subject:
How to run Scrapy from within a Python script
Confused about running Scrapy from within a Python script
scrapy run spider from script
You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.
The tutorial implies that Scrapy is, in fact, a separate program.
Running the command scrapy startproject tutorial will create a folder called tutorial several files already set up for you.
For example, in my case, the modules/packages items, pipelines, settings and spiders have been added to the root package tutorial .
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
The TorrentItem class would be placed inside items.py, and the MininovaSpider class would go inside the spiders folder.
Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:
scrapy crawl <website-name> -o <output-file> -t <output-type>
Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:
scrapy runspider my_spider.py
i have multiple spiders in one project , problem is right now i am defining LOG_FILE in SETTINGS like
LOG_FILE = "scrapy_%s.log" % datetime.now()
what i want is scrapy_SPIDERNAME_DATETIME
but i am unable to provide spidername in log_file name ..
i found
scrapy.log.start(logfile=None, loglevel=None, logstdout=None)
and called it in each spider init method but its not working ..
any help would be appreciated
The spider's __init__() is not early enough to call log.start() by itself since the log observer is already started at this point; therefore, you need to reinitialize the logging state to trick Scrapy into (re)starting it.
In your spider class file:
from datetime import datetime
from scrapy import log
from scrapy.spider import BaseSpider
class ExampleSpider(BaseSpider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
def __init__(self, name=None, **kwargs):
LOG_FILE = "scrapy_%s_%s.log" % (self.name, datetime.now())
# remove the current log
# log.log.removeObserver(log.log.theLogPublisher.observers[0])
# re-create the default Twisted observer which Scrapy checks
log.log.defaultObserver = log.log.DefaultObserver()
# start the default observer so it can be stopped
log.log.defaultObserver.start()
# trick Scrapy into thinking logging has not started
log.started = False
# start the new log file observer
log.start(LOG_FILE)
# continue with the normal spider init
super(ExampleSpider, self).__init__(name, **kwargs)
def parse(self, response):
...
And the output file might look like:
scrapy_example_2012-08-25 12:34:48.823896.log
There should be a BOT_NAME in your settings.py. This is the project/spider name. So in your case, this would be
LOG_FILE = "scrapy_%s_%s.log" % (BOT_NAME, datetime.now())
This is pretty much the same that Scrapy does internally
But why not use log.msg. The docs clearly state that this is for spider specific stuff. It might be easier to use this and just extract/grep/... the different spider log messages from a big log file.
A more compicated approach would be to get the location of the spider SPIDER_MODULES list and load all spiders inside these package.
You can use Scrapy's Storage URI parameters in your settings.py file for FEED URI.
%(name)s
%(time)s
For example: /tmp/crawled/%(name)s/%(time)s.log