I'm writing a script that uses CrawlerProcess to run a spider of class MySpider defined in mybot/spiders/myspider.py.
Here's the relevant part of my code:
# in main.py
from scrapy.crawler import CrawlerProcess
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
items = []
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl("joe", domain='dictionary.com')
def add_item(item):
items.append(item)
dispatcher.connect(add_item, signals.item_passed) # adds result from spider to items
process.start()
.
#in /spiders/myspider.py
from scrapy.spiders import Spider
from mybot.items import MyItem
name = "joe"
allowed_domains = ["dictionary.com"]
start_urls = ["http://dictionary.reference.com/"]
for sel in response.xpath('//tr[#class="alt"]'):
new_item = MyItem()
new_item['name'] = sel.xpath('td/a/text()')[0].extract()
yield(new_item)
Now, I want to change the program so that I can pass some other start_url to the spider from main.py. It looks like I can pass the allowed_domains argument to the spider via
process.crawl("joe", domain='dictionary.com')
but I don't know how to generalize that.
I think I have to redefine the MySpider's constructor to accept an optional argument, but it doesn't look like the spider is created in main.py
(and in fact, the command new_spider = MySpider() returns the error global name 'MySpider' is not defined.
So my question is twofold:
How do I change the spider's constructor?
How do I pass the start_urls to the spider from main.py?
Or is there perhaps a different solution altogether?
Related
so I'm trying to make things cleaner with use of items and pipelines. For now my settings for pipelines are (in settings.py):
ITEM_PIPELINES = {
'scrapybot.pipelines.CsvPipeline': 300,
}
Pipelines content (as a test) in pipelines.py into the same folder than settings :
class CsvPipeline(object):
def process_item(self, item, spider):
print("it works")
return item
I launch my two spiders from a single file named core, which is in the same folder than pipelines and settings :
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from spiders.carrefour_bot import CarrefourSpider
from spiders.ebay_bot import EbaySpider
process = CrawlerProcess(get_project_settings())
process.crawl(CarrefourSpider)
process.crawl(EbaySpider)
process.start()
And items. Items are not located in this folder but in spiders folder :
class ScrapybotItem(scrapy.Item):
ean = scrapy.Field()
desc = scrapy.Field()
price = scrapy.Field()
company = scrapy.Field()
I call them this way in both spiders :
from spiders.items import ScrapybotItem
My concern is that, pipelines doesnt open. I have in terminal :
INFO: Enabled item pipelines:
[]
It worked at one moment, I tested pipelines with CSVItemExporter. but for one reason now my pipelines doesnt open, even in the basic form I posted. Any idea ?
I want to run multiple spiders, so i try to use CrawlerProcess. But i find the method open_spider will run two times at the beginning and the end with process_item method.
It causes when the spider open , i remove my collection and save the data into mongodb completed. It will remove my collection again finally.
How do i fix the issue and why the method open_spider run two times ?
I tyep scrapy crawl movies run the project:
Here is my movies.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import time
# scrapy api imports
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from Tainan.FirstSpider import FirstSpider
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
process = CrawlerProcess(get_project_settings())
process.crawl(FirstSpider)
process.start()
It's my FirstSpider.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
class FirstSpider(scrapy.Spider):
name = 'first'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
def parse(self, response):
movieHrefs = response.xpath('//*[#class="release_movie_name"]/a/#href').extract()
for movieHref in movieHrefs:
yield Request(movieHref, callback=self.parse_page)
def parse_page(self, response):
print 'FirstSpider => parse_page'
movieImage = response.xpath('//*[#class="foto"]/img/#src').extract()
cnName = response.xpath('//*[#class="movie_intro_info_r"]/h1/text()').extract()
enName = response.xpath('//*[#class="movie_intro_info_r"]/h3/text()').extract()
movieDate = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[0].extract()
movieTime = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[1].extract()
imdbScore = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[3].extract()
movieContent = response.xpath('//*[#class="gray_infobox_inner"]/span/text()').extract_first().strip()
yield {'image': movieImage, 'cnName': cnName, 'enName': enName, 'movieDate': movieDate, 'movieTime': movieTime, 'imdbScore': imdbScore, 'movieContent': movieContent}
It's my pipelines.py:
from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):
global open_count
open_count = 1
global process_count
process_count = 1
def __init__(self):
connection = MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
# My issue is here it will print open_spider count = 2 finally.
def open_spider(self, spider):
global open_count
print 'Pipelines => open_spider count =>'
print open_count
open_count += 1
self.collection.remove({})
# open_spider method call first time and process_item save data to my mongodb.
# but when process_item completed, open_spider method run again...it cause my data that i have saved it has been removed.
def process_item(self, item, spider):
global process_count
print 'Pipelines => process_item count =>'
print process_count
process_count += 1
self.collection.insert(dict(item))
return item
I can't figure it out, some one can help me out that would be appreciated. Thanks in advance.
How do i fix the issue and why the method open_spider run two times ?
The open_spider method runs once per spider, and you're running two spiders.
I tyep scrapy crawl movies run the project
The crawl command will run the spider named movies (MoviesSpider).
To do this, it has to import the movies module, which will cause it to run your FirstSpider as well.
Now, how to fix this depends on what you want to do.
Maybe you should only run a single spider, or have separate settings per spider, or maybe something entirely different.
i'm just getting started using scrapy and i'd like to do the following
Have a list of n domains
i=0
loop for i to n
Use a (mostly) generic CrawlSpider to get all links (a href) of domain[i]
Save results as json lines
to do this, the Spider needs to receive the domain it has to crawl as an argument.
I already successfully created the CrawlSpider:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
class MyItem(Item):
#MyItem Fields
class SubsiteSpider(CrawlSpider):
name = "subsites"
start_urls = []
allowed_domains = []
rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)
def __init__(self, starturl, allowed, *args, **kwargs):
print(args)
self.start_urls.append(starturl)
self.allowed_domains.append(allowed)
super().__init__(**kwargs)
def parse_obj(self, response):
item = MyItem()
#fill Item Fields
return item
process = CrawlerProcess({'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})
process.crawl(SubsiteSpider)
process.start()
If i call it with scrapy crawl subsites -a starturl=http://example.com -a allowed=example.com -o output.jl
the result is exactly as i want it, so this part is fine already.
What i fail to do is create multiple instances of SubsiteSpider, each with a different domain as argument.
I tried (in SpiderRunner.py)
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('subsites', ['https://example.com', 'example.com'])
process.start()
Variant:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
allowed = ["example.com"]
start = ["https://example.com"]
process.crawl('subsites', start, allowed)
process.start()
But i get an error that occurs, i presume, because the argument is not properly passed to __init__, for example TypeError: __init__() missing 1 required positional argument: 'allowed' or TypeError: __init__() missing 2 required positional arguments: 'starturl' and 'allowed'
(Loop is yet to be implemented)
So, here are my questions:
1) What is the proper way to pass arguments to init, if i do not start crawling via scrapy shell, but from within python code?
2) How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?)
3) I am fine with this running each spider after another - would it still be considered best / good practice to do it that way? Could you point to a more extensive tutorial about "running the same spider again and again, with different arguments(=target domains), optionally parallel", if there is one?
Thank you all very much in advance!
If there are any spelling mistakes (not an english native speaker), or if question / details are not precise enough, please tell me how to correct them.
There are a few problems with your code:
start_urls and allowed_domains are class attributes which you modify in __init__(), making them shared across all instances of your class.
What you should do instead is make them instance attributes:
class SubsiteSpider(CrawlSpider):
name = "subsites"
rules = (Rule(LinkExtractor(), callback='parse_obj', follow=True),)
def __init__(self, starturl, allowed, *args, **kwargs):
self.start_urls = [starturl]
self.allowed_domains = [allowed]
super().__init__(*args, **kwargs)
Those last 3 lines should not be in the file with you spider class, since you probably don't want to run that code each time your spider is imported.
Your calling of CrawlProcess.crawl() is slightly wrong. You can use it like this, passing the arguments in the same manner you'd pass them to the spider class' __init__().
process = CrawlerProcess(get_project_settings())
process.crawl('subsites', 'https://example.com', 'example.com')
process.start()
How can i also pass the -o output.jl argument? (or maybe, use allowed argument as filename?
You can achieve the same effect using custom_settings, giving each instance a different FEED_URI setting.
Is it possible to override Scrapy settings after the init function of a spider?
For example if I want to get settings from db and I pass my query parameters as arguments from the cmdline.
def __init__(self, spider_id, **kwargs):
self.spider_id = spider_id
self.set_params(spider_id)
super(Base_Crawler, self).__init__(**kwargs)
def set_params(self):
#TODO
#makes a query in db
#get set variables from query result
#override settings
Technically you can "override" settings after initialization of spider however it would affect nothing because most of them applied earlier.
What you can actually do is to pass parameters to Spider as command-line options using -a and override project settings using -s, for ex.)
Spider:
class TheSpider(scrapy.Spider):
name = 'thespider'
def __init__(self, *args, **kwargs):
self.spider_id = kwargs.pop('spider_id', None)
super(TheSpider).__init__(*args, **kwargs)
CLI:
scrapy crawl thespider -a spider_id=XXX -s SETTTING_TO_OVERRIDE=YYY
If you need something more advanced consider to write custom runner wrapping your spider. Below is example from the docs:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl('followall', domain='scrapinghub.com')
process.start() # the script will block here until the crawling is finished
Just replace get_project_settings with your own routine that returns Settings instance.
Anyway, avoid of overloading of spider's code with non-scraping logic to keep it clean and reusable.
I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.