Trying to run a scrapy crawler from another location within script - python

All,
I'm trying to fully automate my scraping, which is formed by 3 steps:
1- Get the list of index pages for advertisements (Non-scrapy work, for various reasons)
2- Get the list of advertisement URLs from the index pages obtained in step one (Scrapy work)
My scrapy project is in the usual directory:
C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py
(name of the spider inside the "GetAdUrls_spider" file is (name = "getadurls"))
My script to automate the step 1 and 2 is in this directory:
C:\Website_DATA\SCRIPTS\StepByStepLauncher.py
I have tried using the Scrapy documentation to import the crawler and run from inside the script using the following code:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
I keep getting the error "No module named GetAdUrlsFromIndex.spiders.GetAdUrls_spider" when I try to run this script unfortunately.. I tried changing working directory to several few different locations, I played around with names, nothing seemed to work..
Would appreciate any help.. Thanks!

If you do have __init__.py in C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex and C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders then try modifying your script this way
import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

Related

scrapy module path difference when running via script or terminal

I have a scrapy spider that imports a dictionary from another module. My main contentspider.py that contains the spider also contains an import statement from spider_project.spider_project.updated_kw import translated_kw_dicts.
from spider_project.spider_project.updated_kw import translated_kw_dicts
class ContentSpider(CrawlSpider):
name = 'content_spider'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
print(traslated_kw_dicts)
This import statement works fine when the spider is running as a script, but when I run it via conda then I get an error:
spider_project.spider_project.updated_kw import translated_kw_dicts
ModuleNotFoundError: No module named 'spider_project.spider_project'
From conda I am running the spider as it should be - from the directory containing .cfg file: c:/.../.../spider_project
If I will change the import statement to from spider_project.updated_kw import translated_kw_dicts (notice I'm taking out the first spider_project directry) then via conda the spider runs fine, but I get an error Cannot find reference 'updated_kw' in 'imported module spider_project' in my script.
Can someone advise why this is happening?
Here is my project structure:
when you click on the long line up in the file explorer then it should go active. You can copie it and use the fail reading technic.

Run multiple spiders from script in scrapy in loop

I have more than 100 spiders and i want to run 5 spiders at a time using a script. For this i have created a table in database to know about the status of a spider i.e. whether it has finished running , running or waiting to run.
I know how to run multiple spiders inside a script
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
for i in range(10): #this range is just for demo instead of this i
#find the spiders that are waiting to run from database
process.crawl(spider1) #spider name changes based on spider to run
process.crawl(spider2)
print('-------------this is the-----{}--iteration'.format(i))
process.start()
But this is not allowed as the following error occurs:
Traceback (most recent call last):
File "test.py", line 24, in <module>
process.start()
File "/home/g/projects/venv/lib/python3.4/site-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1242, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 1222, in startRunning
ReactorBase.startRunning(self)
File "/home/g/projects/venv/lib/python3.4/site-packages/twisted/internet/base.py", line 730, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
I have searched for above error and not able to resolve it. Managing spiders can be done via ScrapyD but we do not want to use ScrapyD as many spiders are still in development phase.
Any workaround for above scenario is appreciated.
Thanks
For running multiple spiders simultaneously you can use this
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
The answers of this question can help you too.
For more information:
Running multiple spiders in the same process
You need ScrapyD for this purpose
You can run as many spider as you want at the same time, you can constantly check status if a spider is running or not using listjobs API
You can set max_proc=5 in config file that will run maximum of 5 spiders at a single time.
Anyways, talking about your code, your code shoudl work if you do this
process = CrawlerProcess(get_project_settings())
for i in range(10): #this range is just for demo instead of this i
#find the spiders that are waiting to run from database
process.crawl(spider1) #spider name changes based on spider to run
process.crawl(spider2)
print('-------------this is the-----{}--iteration'.format(i))
process.start()
You need to place process.start() outside of loop.
I was able to implement a similar functionality by removing loop from the script and setting a scheduler for every 3 minutes.
Looping functionality was achieved by maintaining a record of how many spiders are currently running and checking if more spiders need to be run or not.Thus at the end, only 5(can be changed) spiders can run concurrently.

How to set default settings for running scrapy as a python script?

I want to run scrapy as a python script, but I cannot figure out how to set the settings correctly or how I can provide them. I'm not sure whether it's an settings-issue, but I assume it.
My config:
Python 2.7 x86 (as virtual environment)
Scrapy 1.2.1
Win 7 x64
I took the advices from https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script to get it running. I have some issues with the following advice:
If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use get_project_settings to get a Settings instance with your project settings.
So what is meant wiht "inside a Scrapy project"? Of course I have to import the libraries and have the dependencies installed, but I want to avoid starting the crawling process with scrapy crawl xyz.
Here's the code of myScrapy.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import os, argparse
#Initialization of directories
projectDir = os.path.dirname(os.path.realpath('__file__'))
generalOutputDir = os.path.join(projectDir, 'output')
parser = argparse.ArgumentParser()
parser.add_argument("url", help="The url which you want to scan", type=str)
args = parser.parse_args()
urlToScan = args.url
#Stripping of given URL to get only the host + TLD
if "https" in urlToScan:
urlToScanNoProt = urlToScan.replace("https://","")
print "used protocol: https"
if "http" in urlToScan:
urlToScanNoProt = urlToScan.replace("http://","")
print "used protocol: http"
class myItem(Item):
url = Field()
class mySpider(CrawlSpider):
name = "linkspider"
allowed_domains = [urlToScanNoProt]
start_urls = [urlToScan,]
rules = (Rule(LinkExtractor(), callback='parse_url', follow=True), )
def generateDirs(self):
if not os.path.exists(generalOutputDir):
os.makedirs(generalOutputDir)
specificOutputDir = os.path.join(generalOutputDir, urlToScanNoProt)
if not os.path.exists(specificOutputDir):
os.makedirs(specificOutputDir)
return specificOutputDir
def parse_url(self, response):
for link in LinkExtractor().extract_links(response):
item = myItem()
item['url'] = response.url
specificOutputDir = self.generateDirs()
filename = os.path.join(specificOutputDir, response.url.split("/")[-2] + ".html")
with open(filename, "wb") as f:
f.write(response.body)
return CrawlSpider.parse(self, response)
return item
process = CrawlerProcess(get_project_settings())
process.crawl(mySpider)
process.start() # the script will block here until the crawling is finished
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)? I think it's an issue with getting the settings as they are set within a "normal" scrapy-project (where you have to run scrapy crawl xyz) because the putput says
2016-11-18 10:38:42 [scrapy] INFO: Overridden settings: {}
I hope you understand my question(s) (English isn't my native language... ;))
Thanks in advance!
When running a crawl with a script (and not scrapy crawl), one of the options is indeed to use CrawlerProcess.
So what is meant wiht "inside a Scrapy project"?
What is meant is if you run your scripts at the root of a scrapy project created with scrapy startproject, i.e. where you have the scrapy.cfg file with the [settings] section among others.
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)?
Read the documentation on scrapy.crawler.CrawlerProcess.crawl() for details:
Parameters:
crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
I don't know that part of the framework, but I suspect with a spider name only -- I believe you meant and not process.crawl("linkspider") , and outside of a scrapy project, scrapy does not know where to look for spiders (it has no hint). Hence, to tell scrapy which spider to run, might as well give the class directly (and not an instance of a spider class).
get_project_settings() is a helper, but essentially, CrawlerProcess needs to be initialized with a Settings object (see https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess)
In fact, it also accepts a settings dict (which is internally converted into a Settings instance), as shown in the example you linked to:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
So depending on what settings you need to override compared to scrapy defaults, you need to do something like:
process = CrawlerProcess({
'SOME_SETTING_KEY': somevalue,
'SOME_OTHERSETTING_KEY': someothervalue,
...
})
process.crawl(mySpider)
...

Scrapy - running spider from a python script

I am trying to run scrapy from a python script according to documentation http://scrapy.readthedocs.io/en/0.16/topics/practices.html
def CrawlTest():
spider = PitchforkSpider(domain='"pitchfork.com"')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
but when I run it, I get the following error:
AttributeError: 'Settings' object has no attribute 'update_settings'
has something been deprecated? what is wrong here?
my version is Scrapy 1.1.2
You are looking at Scrapy 0.16 docs but using Scrapy 1.1.2.
Here is the correct documentation page.
FYI, you should now be using CrawlerProcess or CrawlerRunner.

Running scrapy from python script

ive been trying to run scrapy from a python script file because i need to get the data and save it into my db. but when i run it with scrapy command
scrapy crawl argos
the script runs fine
but when im trying to run it with a script, following this link
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
i get this error
$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
File "pricewatch/pricewatch.py", line 39, in <module>
main()
File "pricewatch/pricewatch.py", line 31, in main
update()
File "pricewatch/pricewatch.py", line 24, in update
setup_crawler("argos.co.uk")
File "pricewatch/pricewatch.py", line 13, in setup_crawler
settings = get_project_settings()
File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
settings_module = import_module(settings_module_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named settings
i am unable to understand why it doesnt found get_project_setting() but runs fine with scrapy command on terminal
here is the screen shot of my project
here is the pricewatch.py code:
import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings
def setup_crawler(domain):
spider = ArgosSpider(domain=domain)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
def update():
#print "Enter a product to update:"
#product = raw_input()
#print product
#db = DBInstance()
setup_crawler("argos.co.uk")
log.start()
reactor.run()
def main():
try:
if sys.argv[1] == "update":
update()
elif sys.argv[1] == "database":
#db = DBInstance()
except IndexError:
print "You must select a command from Update, Search, History"
if __name__ =='__main__':
main()
i have fixed it
just need to put pricewatch.py to project's top level directory and then running it solved it
This answer is heavily copied from this answer which I believe answers your question and additionally provides a descent example.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper is executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spiders = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper).
So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
etc...

Categories