I have a code that is using Scrpay framework and here's the code
import scrapy
from scrapy.crawler import CrawlerProcess
class DemoSpider(scrapy.Spider):
name = "DemoSpider"
def start_requests(self):
urls = ['http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split('/')[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved File %s' % filename)
process = CrawlerProcess()
process.crawl(DemoSpider)
process.start()
The code is working well when running like that from terminal (Windows 10 PowerShell) python demo.py.
But I need to run the code using Spyder IDE. When trying I got an error like that
ReactorBase.startRunning(self)
File "C:\ProgramData\Anaconda3\lib\site-packages\twisted\internet\base.py", line 765, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable
(Spyder maintainer here) Please go to the menu Run > Configuration per file and activate the option Execute in an external system terminal.
That will run your code in a regular Python interpreter, which will avoid the problems you're having to start the server that runs the scraper in our IPython console.
Related
I have a scrapy spider that imports a dictionary from another module. My main contentspider.py that contains the spider also contains an import statement from spider_project.spider_project.updated_kw import translated_kw_dicts.
from spider_project.spider_project.updated_kw import translated_kw_dicts
class ContentSpider(CrawlSpider):
name = 'content_spider'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
print(traslated_kw_dicts)
This import statement works fine when the spider is running as a script, but when I run it via conda then I get an error:
spider_project.spider_project.updated_kw import translated_kw_dicts
ModuleNotFoundError: No module named 'spider_project.spider_project'
From conda I am running the spider as it should be - from the directory containing .cfg file: c:/.../.../spider_project
If I will change the import statement to from spider_project.updated_kw import translated_kw_dicts (notice I'm taking out the first spider_project directry) then via conda the spider runs fine, but I get an error Cannot find reference 'updated_kw' in 'imported module spider_project' in my script.
Can someone advise why this is happening?
Here is my project structure:
when you click on the long line up in the file explorer then it should go active. You can copie it and use the fail reading technic.
I'm new to Python and web scraping. Pls excuse me for my ignorance. In this program, I want to run my spider on a schedule. I use Python 3.7 and MacOs.
I wrote cronjob using crontab and called shell script to run the scrapy spider. However it executed only once with the last line "INFO: Closing spider (finished)". Didn't repeat according to the schedule. I executed simple python script to test the schedule and then it worked. Seems this issue only with the spider. Please help to understand how to fix this. Any help would be appreciated. Thank you
import csv
import os
import random
from time import sleep
import scrapy
class spider1(scrapy.Spider):
name = "amspider"
with open("data.csv", "a") as filee:
if os.stat("data.csv").st_size != 0:
filee.truncate(0)
filee.close()
def start_requests(self):
list = ["https://www.example.com/item1",
"https://www.example.com/item2",
"https://www.example.com/item3",
"https://www.example.com/item4",
"https://www.example.com/item5"
]
for i in list:
yield scrapy.Request(i, callback=self.parse)
sleep(random.randint(0, 5))
def parse(self, response):
product_name = response.css('#pd-h1-cartridge::text')[0].extract()
product_price = response.css(
'.product-price .is-current, .product-price_total .is-current, .product-price_total ins, .product-price ins').css(
'::text')[3].extract()
print(product_name)
print(product_price)
with open('data.csv', 'a') as file:
itemwriter = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
itemwriter.writerow([str(product_name).strip(), str(product_price).strip()])
file.close()
amsp.sh
#!/bin/sh
cd /Users/amal/PycharmProjects/AmProj2/amazonspider
PATH=$PATH:/usr/local/bin/
export PATH
scrapy crawl amspider
crontab
Tried both ways But spider executed only once.
*/2 * * * * /Users/amal/Documents/amsp.sh
*/2 * * * * cd /Users/amal/PycharmProjects/AmProj2/amazonspider && scrapy crawl amspider
I want to run scrapy as a python script, but I cannot figure out how to set the settings correctly or how I can provide them. I'm not sure whether it's an settings-issue, but I assume it.
My config:
Python 2.7 x86 (as virtual environment)
Scrapy 1.2.1
Win 7 x64
I took the advices from https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script to get it running. I have some issues with the following advice:
If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use get_project_settings to get a Settings instance with your project settings.
So what is meant wiht "inside a Scrapy project"? Of course I have to import the libraries and have the dependencies installed, but I want to avoid starting the crawling process with scrapy crawl xyz.
Here's the code of myScrapy.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import os, argparse
#Initialization of directories
projectDir = os.path.dirname(os.path.realpath('__file__'))
generalOutputDir = os.path.join(projectDir, 'output')
parser = argparse.ArgumentParser()
parser.add_argument("url", help="The url which you want to scan", type=str)
args = parser.parse_args()
urlToScan = args.url
#Stripping of given URL to get only the host + TLD
if "https" in urlToScan:
urlToScanNoProt = urlToScan.replace("https://","")
print "used protocol: https"
if "http" in urlToScan:
urlToScanNoProt = urlToScan.replace("http://","")
print "used protocol: http"
class myItem(Item):
url = Field()
class mySpider(CrawlSpider):
name = "linkspider"
allowed_domains = [urlToScanNoProt]
start_urls = [urlToScan,]
rules = (Rule(LinkExtractor(), callback='parse_url', follow=True), )
def generateDirs(self):
if not os.path.exists(generalOutputDir):
os.makedirs(generalOutputDir)
specificOutputDir = os.path.join(generalOutputDir, urlToScanNoProt)
if not os.path.exists(specificOutputDir):
os.makedirs(specificOutputDir)
return specificOutputDir
def parse_url(self, response):
for link in LinkExtractor().extract_links(response):
item = myItem()
item['url'] = response.url
specificOutputDir = self.generateDirs()
filename = os.path.join(specificOutputDir, response.url.split("/")[-2] + ".html")
with open(filename, "wb") as f:
f.write(response.body)
return CrawlSpider.parse(self, response)
return item
process = CrawlerProcess(get_project_settings())
process.crawl(mySpider)
process.start() # the script will block here until the crawling is finished
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)? I think it's an issue with getting the settings as they are set within a "normal" scrapy-project (where you have to run scrapy crawl xyz) because the putput says
2016-11-18 10:38:42 [scrapy] INFO: Overridden settings: {}
I hope you understand my question(s) (English isn't my native language... ;))
Thanks in advance!
When running a crawl with a script (and not scrapy crawl), one of the options is indeed to use CrawlerProcess.
So what is meant wiht "inside a Scrapy project"?
What is meant is if you run your scripts at the root of a scrapy project created with scrapy startproject, i.e. where you have the scrapy.cfg file with the [settings] section among others.
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)?
Read the documentation on scrapy.crawler.CrawlerProcess.crawl() for details:
Parameters:
crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
I don't know that part of the framework, but I suspect with a spider name only -- I believe you meant and not process.crawl("linkspider") , and outside of a scrapy project, scrapy does not know where to look for spiders (it has no hint). Hence, to tell scrapy which spider to run, might as well give the class directly (and not an instance of a spider class).
get_project_settings() is a helper, but essentially, CrawlerProcess needs to be initialized with a Settings object (see https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess)
In fact, it also accepts a settings dict (which is internally converted into a Settings instance), as shown in the example you linked to:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
So depending on what settings you need to override compared to scrapy defaults, you need to do something like:
process = CrawlerProcess({
'SOME_SETTING_KEY': somevalue,
'SOME_OTHERSETTING_KEY': someothervalue,
...
})
process.crawl(mySpider)
...
ive been trying to run scrapy from a python script file because i need to get the data and save it into my db. but when i run it with scrapy command
scrapy crawl argos
the script runs fine
but when im trying to run it with a script, following this link
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
i get this error
$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
File "pricewatch/pricewatch.py", line 39, in <module>
main()
File "pricewatch/pricewatch.py", line 31, in main
update()
File "pricewatch/pricewatch.py", line 24, in update
setup_crawler("argos.co.uk")
File "pricewatch/pricewatch.py", line 13, in setup_crawler
settings = get_project_settings()
File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
settings_module = import_module(settings_module_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named settings
i am unable to understand why it doesnt found get_project_setting() but runs fine with scrapy command on terminal
here is the screen shot of my project
here is the pricewatch.py code:
import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings
def setup_crawler(domain):
spider = ArgosSpider(domain=domain)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
def update():
#print "Enter a product to update:"
#product = raw_input()
#print product
#db = DBInstance()
setup_crawler("argos.co.uk")
log.start()
reactor.run()
def main():
try:
if sys.argv[1] == "update":
update()
elif sys.argv[1] == "database":
#db = DBInstance()
except IndexError:
print "You must select a command from Update, Search, History"
if __name__ =='__main__':
main()
i have fixed it
just need to put pricewatch.py to project's top level directory and then running it solved it
This answer is heavily copied from this answer which I believe answers your question and additionally provides a descent example.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper is executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spiders = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper).
So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
etc...
All,
I'm trying to fully automate my scraping, which is formed by 3 steps:
1- Get the list of index pages for advertisements (Non-scrapy work, for various reasons)
2- Get the list of advertisement URLs from the index pages obtained in step one (Scrapy work)
My scrapy project is in the usual directory:
C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py
(name of the spider inside the "GetAdUrls_spider" file is (name = "getadurls"))
My script to automate the step 1 and 2 is in this directory:
C:\Website_DATA\SCRIPTS\StepByStepLauncher.py
I have tried using the Scrapy documentation to import the crawler and run from inside the script using the following code:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
I keep getting the error "No module named GetAdUrlsFromIndex.spiders.GetAdUrls_spider" when I try to run this script unfortunately.. I tried changing working directory to several few different locations, I played around with names, nothing seemed to work..
Would appreciate any help.. Thanks!
If you do have __init__.py in C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex and C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders then try modifying your script this way
import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here