Scrapy - running spider from a python script - python

I am trying to run scrapy from a python script according to documentation http://scrapy.readthedocs.io/en/0.16/topics/practices.html
def CrawlTest():
spider = PitchforkSpider(domain='"pitchfork.com"')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
but when I run it, I get the following error:
AttributeError: 'Settings' object has no attribute 'update_settings'
has something been deprecated? what is wrong here?
my version is Scrapy 1.1.2

You are looking at Scrapy 0.16 docs but using Scrapy 1.1.2.
Here is the correct documentation page.
FYI, you should now be using CrawlerProcess or CrawlerRunner.

Related

Scrapy does not activate my pipelines when starting the spider from a python script

I am having basically the same Problem as here. But the answer does not solves my problems. I am using get_project_settings() in my main.py script.
process = CrawlerProcess(settings=get_project_settings())
process.crawl(MySpider)
process.start()
The main.py script that is located at the root of the project:
MyProject/
|main.py
|...
|src/
||scraper/
|||myScrapyProject/
||||myScrapyProject/
|||||pipelines.py
|||||settings.py
|||||...
|||||spiders/
||||||mySpider.py
Starting the spider with scrapy crawl MySpider yields correct results and uses the pipeline.
The settings.py has ITEM_PIPELINES configured
ITEM_PIPELINES = {
'myScrapyProject.pipelines.PersistencePipeline': 300,
}
Starting the spider from main runs the spider normally, but does not use the pipeline. Also I dont see the pipeline configured in the logs when scrapy starts:
[INFO] scrapy.middleware: Enabled item pipelines:
[]

scrapy module path difference when running via script or terminal

I have a scrapy spider that imports a dictionary from another module. My main contentspider.py that contains the spider also contains an import statement from spider_project.spider_project.updated_kw import translated_kw_dicts.
from spider_project.spider_project.updated_kw import translated_kw_dicts
class ContentSpider(CrawlSpider):
name = 'content_spider'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
print(traslated_kw_dicts)
This import statement works fine when the spider is running as a script, but when I run it via conda then I get an error:
spider_project.spider_project.updated_kw import translated_kw_dicts
ModuleNotFoundError: No module named 'spider_project.spider_project'
From conda I am running the spider as it should be - from the directory containing .cfg file: c:/.../.../spider_project
If I will change the import statement to from spider_project.updated_kw import translated_kw_dicts (notice I'm taking out the first spider_project directry) then via conda the spider runs fine, but I get an error Cannot find reference 'updated_kw' in 'imported module spider_project' in my script.
Can someone advise why this is happening?
Here is my project structure:
when you click on the long line up in the file explorer then it should go active. You can copie it and use the fail reading technic.

How to set default settings for running scrapy as a python script?

I want to run scrapy as a python script, but I cannot figure out how to set the settings correctly or how I can provide them. I'm not sure whether it's an settings-issue, but I assume it.
My config:
Python 2.7 x86 (as virtual environment)
Scrapy 1.2.1
Win 7 x64
I took the advices from https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script to get it running. I have some issues with the following advice:
If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use get_project_settings to get a Settings instance with your project settings.
So what is meant wiht "inside a Scrapy project"? Of course I have to import the libraries and have the dependencies installed, but I want to avoid starting the crawling process with scrapy crawl xyz.
Here's the code of myScrapy.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import os, argparse
#Initialization of directories
projectDir = os.path.dirname(os.path.realpath('__file__'))
generalOutputDir = os.path.join(projectDir, 'output')
parser = argparse.ArgumentParser()
parser.add_argument("url", help="The url which you want to scan", type=str)
args = parser.parse_args()
urlToScan = args.url
#Stripping of given URL to get only the host + TLD
if "https" in urlToScan:
urlToScanNoProt = urlToScan.replace("https://","")
print "used protocol: https"
if "http" in urlToScan:
urlToScanNoProt = urlToScan.replace("http://","")
print "used protocol: http"
class myItem(Item):
url = Field()
class mySpider(CrawlSpider):
name = "linkspider"
allowed_domains = [urlToScanNoProt]
start_urls = [urlToScan,]
rules = (Rule(LinkExtractor(), callback='parse_url', follow=True), )
def generateDirs(self):
if not os.path.exists(generalOutputDir):
os.makedirs(generalOutputDir)
specificOutputDir = os.path.join(generalOutputDir, urlToScanNoProt)
if not os.path.exists(specificOutputDir):
os.makedirs(specificOutputDir)
return specificOutputDir
def parse_url(self, response):
for link in LinkExtractor().extract_links(response):
item = myItem()
item['url'] = response.url
specificOutputDir = self.generateDirs()
filename = os.path.join(specificOutputDir, response.url.split("/")[-2] + ".html")
with open(filename, "wb") as f:
f.write(response.body)
return CrawlSpider.parse(self, response)
return item
process = CrawlerProcess(get_project_settings())
process.crawl(mySpider)
process.start() # the script will block here until the crawling is finished
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)? I think it's an issue with getting the settings as they are set within a "normal" scrapy-project (where you have to run scrapy crawl xyz) because the putput says
2016-11-18 10:38:42 [scrapy] INFO: Overridden settings: {}
I hope you understand my question(s) (English isn't my native language... ;))
Thanks in advance!
When running a crawl with a script (and not scrapy crawl), one of the options is indeed to use CrawlerProcess.
So what is meant wiht "inside a Scrapy project"?
What is meant is if you run your scripts at the root of a scrapy project created with scrapy startproject, i.e. where you have the scrapy.cfg file with the [settings] section among others.
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)?
Read the documentation on scrapy.crawler.CrawlerProcess.crawl() for details:
Parameters:
crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
I don't know that part of the framework, but I suspect with a spider name only -- I believe you meant and not process.crawl("linkspider") , and outside of a scrapy project, scrapy does not know where to look for spiders (it has no hint). Hence, to tell scrapy which spider to run, might as well give the class directly (and not an instance of a spider class).
get_project_settings() is a helper, but essentially, CrawlerProcess needs to be initialized with a Settings object (see https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess)
In fact, it also accepts a settings dict (which is internally converted into a Settings instance), as shown in the example you linked to:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
So depending on what settings you need to override compared to scrapy defaults, you need to do something like:
process = CrawlerProcess({
'SOME_SETTING_KEY': somevalue,
'SOME_OTHERSETTING_KEY': someothervalue,
...
})
process.crawl(mySpider)
...

Scrapy - Can't call scraper from a script in parent folder to scrapy project

I've got a bit of a strange one that I can't get my head around here:
I've setup a webscraper using Scrapy and it performs the scrape fine when I run the following file from the cli ($ python journal_scraper.py):
journal_scraper.py:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def checkForUpdates():
process = CrawlerProcess(get_project_settings())
process.crawl('journal')
process.crawl('article')
process.start()
if __name__ == '__main__':
checkForUpdates()
The process is able to find the two spiders journal and article without a problem.
Now, I'd like to call this scrape as one of many steps within an application that I'm developing so from the parent folder to the Scrapy project I import journal_scraper.py into my main.py file and try run the checkForUpdates() function:
main.py:
from scripts.journal_scraper import checkForUpdates
checkForUpdates()
and I get the following:
2016-01-10 20:30:56 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-01-10 20:30:56 [scrapy] INFO: Optional features available: ssl, http11
2016-01-10 20:30:56 [scrapy] INFO: Overridden settings: {}
Traceback (most recent call last):
File "main.py", line 13, in <module>
checkForUpdates()
File "/Users/oldo/Python/projects/AMS-Journal-Scraping/AMS_Journals/scripts/journal_scraper.py", line 8, in checkForUpdates
process.crawl('journal')
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/crawler.py", line 150, in crawl
crawler = self._create_crawler(crawler_or_spidercls)
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/crawler.py", line 165, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/spiderloader.py", line 40, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: journal'
I've also tried changing main.py to:
import subprocess
subprocess.call('python ./scripts/scraper.py', shell=True)
Which yields the same error.
I'm pretty sure that it has something to do with the fact that I am calling this function form the parent folder because if I make a little test script in the same folder as journal_scraper.py that does the same thing as main.py the scraper runs as expected.
Is there some sort of restriction on calling scrapers from a script external to the Scrapy project?
Please ask for further details if my situation is not clear.
Although it is very late and if you still looking for the solution, Try importing the class of your spider:
from parent1.parent1.spiders.spider_file_name import spider_class_name

Trying to run a scrapy crawler from another location within script

All,
I'm trying to fully automate my scraping, which is formed by 3 steps:
1- Get the list of index pages for advertisements (Non-scrapy work, for various reasons)
2- Get the list of advertisement URLs from the index pages obtained in step one (Scrapy work)
My scrapy project is in the usual directory:
C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py
(name of the spider inside the "GetAdUrls_spider" file is (name = "getadurls"))
My script to automate the step 1 and 2 is in this directory:
C:\Website_DATA\SCRIPTS\StepByStepLauncher.py
I have tried using the Scrapy documentation to import the crawler and run from inside the script using the following code:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
I keep getting the error "No module named GetAdUrlsFromIndex.spiders.GetAdUrls_spider" when I try to run this script unfortunately.. I tried changing working directory to several few different locations, I played around with names, nothing seemed to work..
Would appreciate any help.. Thanks!
If you do have __init__.py in C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex and C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders then try modifying your script this way
import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

Categories