Pipeline doesnt open Scrapy - python

so I'm trying to make things cleaner with use of items and pipelines. For now my settings for pipelines are (in settings.py):
ITEM_PIPELINES = {
'scrapybot.pipelines.CsvPipeline': 300,
}
Pipelines content (as a test) in pipelines.py into the same folder than settings :
class CsvPipeline(object):
def process_item(self, item, spider):
print("it works")
return item
I launch my two spiders from a single file named core, which is in the same folder than pipelines and settings :
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from spiders.carrefour_bot import CarrefourSpider
from spiders.ebay_bot import EbaySpider
process = CrawlerProcess(get_project_settings())
process.crawl(CarrefourSpider)
process.crawl(EbaySpider)
process.start()
And items. Items are not located in this folder but in spiders folder :
class ScrapybotItem(scrapy.Item):
ean = scrapy.Field()
desc = scrapy.Field()
price = scrapy.Field()
company = scrapy.Field()
I call them this way in both spiders :
from spiders.items import ScrapybotItem
My concern is that, pipelines doesnt open. I have in terminal :
INFO: Enabled item pipelines:
[]
It worked at one moment, I tested pipelines with CSVItemExporter. but for one reason now my pipelines doesnt open, even in the basic form I posted. Any idea ?

Related

open_spider method run two times when using CrawlerProcess

I want to run multiple spiders, so i try to use CrawlerProcess. But i find the method open_spider will run two times at the beginning and the end with process_item method.
It causes when the spider open , i remove my collection and save the data into mongodb completed. It will remove my collection again finally.
How do i fix the issue and why the method open_spider run two times ?
I tyep scrapy crawl movies run the project:
Here is my movies.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
import time
# scrapy api imports
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from Tainan.FirstSpider import FirstSpider
class MoviesSpider(scrapy.Spider):
name = 'movies'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
process = CrawlerProcess(get_project_settings())
process.crawl(FirstSpider)
process.start()
It's my FirstSpider.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
class FirstSpider(scrapy.Spider):
name = 'first'
allowed_domains = ['tw.movies.yahoo.com', 'movies.yahoo.com.tw']
start_urls = ['http://tw.movies.yahoo.com/movie_thisweek.html/']
def parse(self, response):
movieHrefs = response.xpath('//*[#class="release_movie_name"]/a/#href').extract()
for movieHref in movieHrefs:
yield Request(movieHref, callback=self.parse_page)
def parse_page(self, response):
print 'FirstSpider => parse_page'
movieImage = response.xpath('//*[#class="foto"]/img/#src').extract()
cnName = response.xpath('//*[#class="movie_intro_info_r"]/h1/text()').extract()
enName = response.xpath('//*[#class="movie_intro_info_r"]/h3/text()').extract()
movieDate = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[0].extract()
movieTime = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[1].extract()
imdbScore = response.xpath('//*[#class="movie_intro_info_r"]/span/text()')[3].extract()
movieContent = response.xpath('//*[#class="gray_infobox_inner"]/span/text()').extract_first().strip()
yield {'image': movieImage, 'cnName': cnName, 'enName': enName, 'movieDate': movieDate, 'movieTime': movieTime, 'imdbScore': imdbScore, 'movieContent': movieContent}
It's my pipelines.py:
from pymongo import MongoClient
from scrapy.conf import settings
class MongoDBPipeline(object):
global open_count
open_count = 1
global process_count
process_count = 1
def __init__(self):
connection = MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
# My issue is here it will print open_spider count = 2 finally.
def open_spider(self, spider):
global open_count
print 'Pipelines => open_spider count =>'
print open_count
open_count += 1
self.collection.remove({})
# open_spider method call first time and process_item save data to my mongodb.
# but when process_item completed, open_spider method run again...it cause my data that i have saved it has been removed.
def process_item(self, item, spider):
global process_count
print 'Pipelines => process_item count =>'
print process_count
process_count += 1
self.collection.insert(dict(item))
return item
I can't figure it out, some one can help me out that would be appreciated. Thanks in advance.
How do i fix the issue and why the method open_spider run two times ?
The open_spider method runs once per spider, and you're running two spiders.
I tyep scrapy crawl movies run the project
The crawl command will run the spider named movies (MoviesSpider).
To do this, it has to import the movies module, which will cause it to run your FirstSpider as well.
Now, how to fix this depends on what you want to do.
Maybe you should only run a single spider, or have separate settings per spider, or maybe something entirely different.

Scrapy: Pass arguments to spider that is defined in /spiders directory

I'm writing a script that uses CrawlerProcess to run a spider of class MySpider defined in mybot/spiders/myspider.py.
Here's the relevant part of my code:
# in main.py
from scrapy.crawler import CrawlerProcess
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher
items = []
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl("joe", domain='dictionary.com')
def add_item(item):
items.append(item)
dispatcher.connect(add_item, signals.item_passed) # adds result from spider to items
process.start()
.
#in /spiders/myspider.py
from scrapy.spiders import Spider
from mybot.items import MyItem
name = "joe"
allowed_domains = ["dictionary.com"]
start_urls = ["http://dictionary.reference.com/"]
for sel in response.xpath('//tr[#class="alt"]'):
new_item = MyItem()
new_item['name'] = sel.xpath('td/a/text()')[0].extract()
yield(new_item)
Now, I want to change the program so that I can pass some other start_url to the spider from main.py. It looks like I can pass the allowed_domains argument to the spider via
process.crawl("joe", domain='dictionary.com')
but I don't know how to generalize that.
I think I have to redefine the MySpider's constructor to accept an optional argument, but it doesn't look like the spider is created in main.py
(and in fact, the command new_spider = MySpider() returns the error global name 'MySpider' is not defined.
So my question is twofold:
How do I change the spider's constructor?
How do I pass the start_urls to the spider from main.py?
Or is there perhaps a different solution altogether?

Scrapy spider not following links when using Celery

I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:
I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:
crawl.delay(domain)
The task itself just calls a function that starts the crawling process:
from .crawler.crawl import run_spider
from celery import shared_task
#shared_task
def crawl(domain):
return run_spider(domain)
run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.
WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).
from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider
class WebSpider(CrawlSpider):
name = "web"
def __init__(self, **kw):
super(WebSpider, self).__init__(**kw)
url = kw.get('domain') or kw.get('url')
if not (url.startswith('http://') or url.startswith('https://')):
url = "http://%s/" % url
self.url = url
self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
self.start_urls = [url]
self.rules = [
Rule(SgmlLinkExtractor(
allow_domains=self.allowed_domains,
unique=True), callback='parse_page', follow=True)
]
def parse_start_url(self, response):
return self.parse_page(response)
def parse_page(self, response):
sel = Selector(response)
item = HTMLPageItem()
item['url'] = response.request.url
item['title'] = sel.xpath('//title/text()').extract()[0]
item.save()
return item
The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!
Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.
So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
I was able to locate the problem:
In the initialization function of WebSpider:
super(WebSpider, self).__init__(**kw)
The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.
Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.
Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:
self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
out of the if statement, as suggested in the comments there.
Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:
Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:
import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'
In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:
import sys
sys.path.append('/absolute/path/to/scrapy/project')
Change the paths to suit your case.

Scrapy Very Basic Example

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.
They were trying to run the command:
scrapy crawl mininova.org -o scraped_data.json -t json
I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?
TL;DR: see Self-contained minimum example script to run scrapy.
First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.
If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special scrapy command-line tool:
scrapy crawl myspider
But, Scrapy also provides an API to run crawling from a script.
There are several key concepts that should be mentioned:
Settings class - basically a key-value "container" which is initialized with default built-in values
Crawler class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
Twisted reactor - since Scrapy is built-in on top of twisted asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:
The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and
dispatches events or messages in a program. It works by calling some
internal or external “event provider”, which generally blocks until an
event has arrived, and then calls the relevant event handler
(“dispatches the event”). The reactor provides basic interfaces to a
number of services, including network communications, threading, and
event dispatching.
Here is a basic and simplified process of running Scrapy from script:
create a Settings instance (or use get_project_settings() to use existing settings):
settings = Settings() # or settings = get_project_settings()
instantiate Crawler with settings instance passed in:
crawler = Crawler(settings)
instantiate a spider (this is what it is all about eventually, right?):
spider = MySpider()
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor needs to be stopped manually. Scrapy docs suggest to stop the reactor in the spider_closed signal handler:
Note that you will also have to shutdown the Twisted reactor yourself
after the spider is finished. This can be achieved by connecting a
handler to the signals.spider_closed signal.
def callback(spider, reason):
stats = spider.crawler.stats.get_stats()
# stats here is a dictionary of crawling stats that you usually see on the console
# here we need to stop the reactor
reactor.stop()
crawler.signals.connect(callback, signal=signals.spider_closed)
configure and start crawler instance with a spider passed in:
crawler.configure()
crawler.crawl(spider)
crawler.start()
optionally start logging:
log.start()
start the reactor - this would block the script execution:
reactor.run()
Here is an example self-contained script that is using DmozSpider spider and involves item loaders with input and output processors and item pipelines:
import json
from scrapy.crawler import Crawler
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst
from scrapy import log, signals, Spider, Item, Field
from scrapy.settings import Settings
from twisted.internet import reactor
# define an item class
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
# define an item loader with input and output processors
class DmozItemLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
desc_out = Join()
# define a pipeline
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
# define a spider
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
loader = DmozItemLoader(DmozItem(), selector=sel, response=response)
loader.add_xpath('title', 'a/text()')
loader.add_xpath('link', 'a/#href')
loader.add_xpath('desc', 'text()')
yield loader.load_item()
# callback fired when the spider is closed
def callback(spider, reason):
stats = spider.crawler.stats.get_stats() # collect/log stats?
# stop the reactor
reactor.stop()
# instantiate settings and provide a custom configuration
settings = Settings()
settings.set('ITEM_PIPELINES', {
'__main__.JsonWriterPipeline': 100
})
# instantiate a crawler passing in settings
crawler = Crawler(settings)
# instantiate a spider
spider = DmozSpider()
# configure signals
crawler.signals.connect(callback, signal=signals.spider_closed)
# configure and start the crawler
crawler.configure()
crawler.crawl(spider)
crawler.start()
# start logging
log.start()
# start the reactor (blocks execution)
reactor.run()
Run it in a usual way:
python runner.py
and observe items exported to items.jl with the help of the pipeline:
{"desc": "", "link": "/", "title": "Top"}
{"link": "/Computers/", "title": "Computers"}
{"link": "/Computers/Programming/", "title": "Programming"}
{"link": "/Computers/Programming/Languages/", "title": "Languages"}
{"link": "/Computers/Programming/Languages/Python/", "title": "Python"}
...
Gist is available here (feel free to improve):
Self-contained minimum example script to run scrapy
Notes:
If you define settings by instantiating a Settings() object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMIT or tweak any other setting, you need to either set it in the script via settings.set() (as demonstrated in the example):
pipelines = {
'mypackage.pipelines.FilterPipeline': 100,
'mypackage.pipelines.MySQLPipeline': 200
}
settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')
or, use an existing settings.py with all the custom settings preconfigured:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
Other useful links on the subject:
How to run Scrapy from within a Python script
Confused about running Scrapy from within a Python script
scrapy run spider from script
You may have better luck looking through the tutorial first, as opposed to the "Scrapy at a glance" webpage.
The tutorial implies that Scrapy is, in fact, a separate program.
Running the command scrapy startproject tutorial will create a folder called tutorial several files already set up for you.
For example, in my case, the modules/packages items, pipelines, settings and spiders have been added to the root package tutorial .
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
The TorrentItem class would be placed inside items.py, and the MininovaSpider class would go inside the spiders folder.
Once the project is set up, the command-line parameters for Scrapy appear to be fairly straightforward. They take the form:
scrapy crawl <website-name> -o <output-file> -t <output-type>
Alternatively, if you want to run scrapy without the overhead of creating a project directory, you can use the runspider command:
scrapy runspider my_spider.py

Running scrapy from inside Python script - CSV exporter doesn't work

My scraper works fine when I run it from the command line, but when I try to run it from within a python script (with the method outlined here using Twisted) it does not output the two CSV files that it normally does. I have a pipeline that creates and populates these files, one of them using CsvItemExporter() and the other using writeCsvFile(). Here is the code:
class CsvExportPipeline(object):
def __init__(self):
self.files = {}
#classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
nodes = open('%s_nodes.csv' % spider.name, 'w+b')
self.files[spider] = nodes
self.exporter1 = CsvItemExporter(nodes, fields_to_export=['url','name','screenshot'])
self.exporter1.start_exporting()
self.edges = []
self.edges.append(['Source','Target','Type','ID','Label','Weight'])
self.num = 1
def spider_closed(self, spider):
self.exporter1.finish_exporting()
file = self.files.pop(spider)
file.close()
writeCsvFile(getcwd()+r'\edges.csv', self.edges)
def process_item(self, item, spider):
self.exporter1.export_item(item)
for url in item['links']:
self.edges.append([item['url'],url,'Directed',self.num,'',1])
self.num += 1
return item
Here is my file structure:
SiteCrawler/ # the CSVs are normally created in this folder
runspider.py # this is the script that runs the scraper
scrapy.cfg
SiteCrawler/
__init__.py
items.py
pipelines.py
screenshooter.py
settings.py
spiders/
__init__.py
myfuncs.py
sitecrawler_spider.py
The scraper appears to function normally in all other ways. The output at the end in the command line suggests that the expected number of pages were crawled and the spider appears to have finished normally. I am not getting any error messages.
---- EDIT : ----
Inserting print statements and syntax errors into the pipeline has no effect, so it appears that the pipeline is being ignored. Why might this be?
Here is the code for the script that runs the scraper (runspider.py):
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.xlib.pydispatch import dispatcher
import logging
from SiteCrawler.spiders.sitecrawler_spider import MySpider
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('Running reactor...')
reactor.run() # the script will block here until the spider is closed
log.msg('Reactor stopped.')
Replacing "from scrapy.settings import Settings" with "from scrapy.utils.project import get_project_settings as Settings" fixed the problem.
The solution was found here. No explanation of the solution was provided.
alecxe has provided an example of how to run Scrapy from inside a Python script.
EDIT:
Having read through alecxe's post in more detail, I can now see the difference between "from scrapy.settings import Settings" and "from scrapy.utils.project import get_project_settings as Settings". The latter allows you to use your project's settings file, as opposed to a defualt settings file. Read alecxe's post (linked to above) for more detail.
In my project i call the scrapy code inside another python script using os.system
import os
os.chdir('/home/admin/source/scrapy_test')
command = "scrapy crawl test_spider -s FEED_URI='file:///home/admin/scrapy/data.csv' -s LOG_FILE='/home/admin/scrapy/scrapy_test.log'"
return_code = os.system(command)
print 'done'

Categories