Running scrapy from python script - python

ive been trying to run scrapy from a python script file because i need to get the data and save it into my db. but when i run it with scrapy command
scrapy crawl argos
the script runs fine
but when im trying to run it with a script, following this link
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
i get this error
$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
File "pricewatch/pricewatch.py", line 39, in <module>
main()
File "pricewatch/pricewatch.py", line 31, in main
update()
File "pricewatch/pricewatch.py", line 24, in update
setup_crawler("argos.co.uk")
File "pricewatch/pricewatch.py", line 13, in setup_crawler
settings = get_project_settings()
File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
settings_module = import_module(settings_module_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named settings
i am unable to understand why it doesnt found get_project_setting() but runs fine with scrapy command on terminal
here is the screen shot of my project
here is the pricewatch.py code:
import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings
def setup_crawler(domain):
spider = ArgosSpider(domain=domain)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
def update():
#print "Enter a product to update:"
#product = raw_input()
#print product
#db = DBInstance()
setup_crawler("argos.co.uk")
log.start()
reactor.run()
def main():
try:
if sys.argv[1] == "update":
update()
elif sys.argv[1] == "database":
#db = DBInstance()
except IndexError:
print "You must select a command from Update, Search, History"
if __name__ =='__main__':
main()

i have fixed it
just need to put pricewatch.py to project's top level directory and then running it solved it

This answer is heavily copied from this answer which I believe answers your question and additionally provides a descent example.
Consider a project with the following structure.
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
Basically, the command
scrapy startproject scraper is executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.
My main file:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
My run_scraper.py file:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spiders = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper).
So in my case:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
etc...

Related

Import settings.py Scrapy

I've got a testmultiple folder which contains an __init__.py file, pipelines, settings, and a core.py file which I use to launch several spiders located in a subfolder (spiders). I noticed that I had to import settings to use pipeline with CrawlerProcess. Here is my code:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.settings import Settings
import settings as my_settings
from spiders.DemoSpider import DemoSpider
from spiders.DemoSpider2 import DemoSpider2
crawler_settings = Settings()
crawler_settings.setmodule(my_settings)
process = CrawlerProcess(settings=crawler_settings)
process.crawl(DemoSpider)
process.crawl(DemoSpider2)
process.start() # the script will block here until the crawling is finished
But it fails on the 4th line. With this try I have :
ModuleNotFoundError: No module named 'testmultiple'
When I try :
from testmultiple.settings import settings as my_settings
I have same error, and with this line as well:
from testmultiple import settings as my_settings
How to import settings.py ?

How to avoid ImportError in twisted tac file?

I have a twisted tac file (twisted_service.py) with a code:
from twisted.application import service
# application.py file in the same dir
from .application import setup_reactor
class WebsocketService(service.Service):
def startService(self):
service.Service.startService(self)
setup_reactor()
application = service.Application("ws")
ws_service = WebsocketService()
ws_service.setServiceParent(application)
Here is application.py file, which sets up the reactor:
# -*- coding: utf-8 -*-
from twisted.web.server import Site
from twisted.web.static import Data
from twisted.internet import reactor, defer
from autobahn.twisted.resource import WebSocketResource
from autobahn.twisted.websocket import WebSocketServerFactory
from txsni.snimap import SNIMap
from txsni.maputils import Cache
from txsni.snimap import HostDirectoryMap
from twisted.python.filepath import FilePath
from tools.database.async import pg_conn
from tools.database import makedsn
from tools.config import main_db
from tools.modules.external import flask_setup
import tools.config as config
import websockethandlers as wsh
from pytrapd import TrapsListener
PROTOCOLMAP = {
'portcounters': wsh.PortCounters,
'eqcounters': wsh.EquipmentCounters,
'settings': wsh.Settings,
'refresh': wsh.Refresher,
'montraps': wsh.TrapsMonitoring,
'fdbs': wsh.FdbParser,
'portstate': wsh.PortState,
'cable': wsh.CableDiagnostic,
'eqcable': wsh.EquipmentCableDiagnostic,
'igmp': wsh.Igmp,
'ipmac': wsh.IpMac,
'lldp': wsh.LLDPParser,
'alias': wsh.AliasSetup,
'ping': wsh.Ping
}
#defer.inlineCallbacks
def setup_reactor():
flask_setup()
yield pg_conn.connect(makedsn(main_db))
root = Data("", "text/plain")
for key in PROTOCOLMAP:
factory = WebSocketServerFactory("wss://localhost:%s" % config.ws_port)
factory.protocol = PROTOCOLMAP[key]
resource = WebSocketResource(factory)
root.putChild(key, resource)
site = Site(root)
context_factory = SNIMap(
Cache(HostDirectoryMap(FilePath(config.certificates_directory)))
)
reactor.listenSSL(config.ws_port, site, context_factory)
traps_listener = TrapsListener()
traps_listener.listen_traps(config.trap_ip)
traps_listener.listen_messages(config.fifo_file)
if __name__ == '__main__':
setup_reactor()
import sys
from twisted.python import log
log.startLogging(sys.stdout)
reactor.run()
I use twistd -noy twisted_service.py command to run the twisted service. It has been working for twisted 16.3.2 version. After upgrade to any next version I got the error:
Unhandled Error
Traceback (most recent call last):
File "/home/kalombo/.virtualenvs/dev/local/lib/python2.7/site-packages/twisted/application/app.py", line 662, in run
runApp(config)
File "/home/kalombo/.virtualenvs/dev/local/lib/python2.7/site-packages/twisted/scripts/twistd.py", line 25, in runApp
_SomeApplicationRunner(config).run()
File "/home/kalombo/.virtualenvs/dev/local/lib/python2.7/site-packages/twisted/application/app.py", line 380, in run
self.application = self.createOrGetApplication()
File "/home/kalombo/.virtualenvs/dev/local/lib/python2.7/site-packages/twisted/application/app.py", line 445, in createOrGetApplication
application = getApplication(self.config, passphrase)
--- <exception caught here> ---
File "/home/kalombo/.virtualenvs/dev/local/lib/python2.7/site-packages/twisted/application/app.py", line 456, in getApplication
application = service.loadApplication(filename, style, passphrase)
File "/home/kalombo/.virtualenvs/dev/local/lib/python2.7/site-packages/twisted/application/service.py", line 412, in loadApplication
application = sob.loadValueFromFile(filename, 'application')
File "/home/kalombo/.virtualenvs/dev/local/lib/python2.7/site-packages/twisted/persisted/sob.py", line 177, in loadValueFromFile
eval(codeObj, d, d)
File "twisted_service.py", line 3, in <module>
from .application import setup_reactor
exceptions.ImportError: No module named application
How should I run the twisted or import module properly?
What the answer I found nearly yours. like the follows:
import os
import sys
sys.path = [os.path.join(os.getcwd(), '.'), ] + sys.path
just add the current working directory to the sys.path.
But I have not found more better method.... I think this not very good.
I found the answer here http://twistedmatrix.com/pipermail/twisted-python/2016-September/030783.html
It is a new feature in Twisted 16.4.0. In previous versions twistd script automatically added working dir to system paths, from 16.4.0 version i have to added it manually. It can be added something like this in twisted_service.py file:
import os
import sys
TWISTED_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(TWISTED_DIR)

How to import model in Django

I want to execute a python file where I want to check if there are new rows in a csv file. If there are new rows, I want to add them to a database.
The project tree is as follows:
Thus, the file which I want to execute is check_relations.py inside relations folder.
check_relations.py is as follows:
from master.models import TraxioRelations
with open('AUTOI_Relaties.csv', 'rb') as tr:
traxio_relations = tr.readlines()
for line in traxio_relations:
number = line.split(';')[0]
number_exists = TraxioRelations.objects.filter(number=number)
print(number_exists)
The model TraxioRelations is inside models.py in master folder.
When I run python check_relations.py I get an error
Traceback (most recent call last):
File "check_relations.py", line 3, in <module>
from master.models import TraxioRelations
ImportError: No module named master.models
What am I doing wrong? How can I import model inside check_relations.py file?
i think the most usable way it is create commands
for example in your master catalog:
master /
management /
__init__.py
commands /
__init__.py
check_relations.py
in check_relations.py
from django.core.management.base import BaseCommand
from master.models import TraxioRelations
class Command(BaseCommand):
help = 'check_relations by file data'
def handle(self, *args, **options):
with open('AUTOI_Relaties.csv', 'rb') as tr:
traxio_relations = tr.readlines()
for line in traxio_relations:
number = line.split(';')[0]
number_exists = TraxioRelations.objects.filter(number=number)
print(number_exists)
! don't forget to change the path to the file AUTOI_Relaties.csv or put it to the new dir
and then you can run in shell:
./manage.py check_relations
When using Django, you are not supposed to run an individual module on its own. See eg https://docs.djangoproject.com/en/1.11/intro/tutorial01/#the-development-server.

Scrapy - Can't call scraper from a script in parent folder to scrapy project

I've got a bit of a strange one that I can't get my head around here:
I've setup a webscraper using Scrapy and it performs the scrape fine when I run the following file from the cli ($ python journal_scraper.py):
journal_scraper.py:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
def checkForUpdates():
process = CrawlerProcess(get_project_settings())
process.crawl('journal')
process.crawl('article')
process.start()
if __name__ == '__main__':
checkForUpdates()
The process is able to find the two spiders journal and article without a problem.
Now, I'd like to call this scrape as one of many steps within an application that I'm developing so from the parent folder to the Scrapy project I import journal_scraper.py into my main.py file and try run the checkForUpdates() function:
main.py:
from scripts.journal_scraper import checkForUpdates
checkForUpdates()
and I get the following:
2016-01-10 20:30:56 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-01-10 20:30:56 [scrapy] INFO: Optional features available: ssl, http11
2016-01-10 20:30:56 [scrapy] INFO: Overridden settings: {}
Traceback (most recent call last):
File "main.py", line 13, in <module>
checkForUpdates()
File "/Users/oldo/Python/projects/AMS-Journal-Scraping/AMS_Journals/scripts/journal_scraper.py", line 8, in checkForUpdates
process.crawl('journal')
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/crawler.py", line 150, in crawl
crawler = self._create_crawler(crawler_or_spidercls)
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/crawler.py", line 165, in _create_crawler
spidercls = self.spider_loader.load(spidercls)
File "/Users/oldo/Python/virtual-environments/AMS-Journal/lib/python2.7/site-packages/scrapy/spiderloader.py", line 40, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: journal'
I've also tried changing main.py to:
import subprocess
subprocess.call('python ./scripts/scraper.py', shell=True)
Which yields the same error.
I'm pretty sure that it has something to do with the fact that I am calling this function form the parent folder because if I make a little test script in the same folder as journal_scraper.py that does the same thing as main.py the scraper runs as expected.
Is there some sort of restriction on calling scrapers from a script external to the Scrapy project?
Please ask for further details if my situation is not clear.
Although it is very late and if you still looking for the solution, Try importing the class of your spider:
from parent1.parent1.spiders.spider_file_name import spider_class_name

Trying to run a scrapy crawler from another location within script

All,
I'm trying to fully automate my scraping, which is formed by 3 steps:
1- Get the list of index pages for advertisements (Non-scrapy work, for various reasons)
2- Get the list of advertisement URLs from the index pages obtained in step one (Scrapy work)
My scrapy project is in the usual directory:
C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py
(name of the spider inside the "GetAdUrls_spider" file is (name = "getadurls"))
My script to automate the step 1 and 2 is in this directory:
C:\Website_DATA\SCRIPTS\StepByStepLauncher.py
I have tried using the Scrapy documentation to import the crawler and run from inside the script using the following code:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
I keep getting the error "No module named GetAdUrlsFromIndex.spiders.GetAdUrls_spider" when I try to run this script unfortunately.. I tried changing working directory to several few different locations, I played around with names, nothing seemed to work..
Would appreciate any help.. Thanks!
If you do have __init__.py in C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex and C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders then try modifying your script this way
import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

Categories