I have a scrapy spider that imports a dictionary from another module. My main contentspider.py that contains the spider also contains an import statement from spider_project.spider_project.updated_kw import translated_kw_dicts.
from spider_project.spider_project.updated_kw import translated_kw_dicts
class ContentSpider(CrawlSpider):
name = 'content_spider'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
print(traslated_kw_dicts)
This import statement works fine when the spider is running as a script, but when I run it via conda then I get an error:
spider_project.spider_project.updated_kw import translated_kw_dicts
ModuleNotFoundError: No module named 'spider_project.spider_project'
From conda I am running the spider as it should be - from the directory containing .cfg file: c:/.../.../spider_project
If I will change the import statement to from spider_project.updated_kw import translated_kw_dicts (notice I'm taking out the first spider_project directry) then via conda the spider runs fine, but I get an error Cannot find reference 'updated_kw' in 'imported module spider_project' in my script.
Can someone advise why this is happening?
Here is my project structure:
when you click on the long line up in the file explorer then it should go active. You can copie it and use the fail reading technic.
Related
I want to write a new service for Jupyter Notebook. But I'm having trouble figuring out how to get it to run. I've created a service similar to the default services found here https://github.com/jupyter/notebook/tree/master/notebook/services.
I'm attempting to run it in a Docker container built from jupyter/base-notebook. I've added c.NotebookApp.extra_services = ['TestHandler'] to the Notebook config and I've copied my service to /opt/conda/lib/python3.6/site-packages/notebook/services/test.py.
When I start the Notebook server I get an error saying ModuleNotFoundError: No module named 'TestHandler' so obviously my service is not being loaded correctly. Unfortunately I can't find any documentation on how to load a service in Jupyter Notebook.
This is my test.py service:
import json
from tornado import web
from ...base.handlers import APIHandler
class TestHandler(APIHandler):
#web.authenticated
def get(self):
res = { "foo": "bar" }
self.finish(json.dumps(res))
default_handlers = [
(r"/api/test", TestHandler),
]
That config value expects a list of importable modules which have a default_handlers attribute. You can therefore either use a full path:
# assuming your `test.py` lives in a top level package `mypackage`
c.NotebookApp.extra_services = ['mypackage.test']
Alteratively you can directly construct a module:
import sys
from types import ModuleType
name = 'some_long_ass_name_that_doesnt_conflict'
sys.modules[name] = ModuleType(name) # make the import machinery find it
sys.modules[name].default_handlers = [...]
c.NotebookApp.extra_services = [name]
I want to run scrapy as a python script, but I cannot figure out how to set the settings correctly or how I can provide them. I'm not sure whether it's an settings-issue, but I assume it.
My config:
Python 2.7 x86 (as virtual environment)
Scrapy 1.2.1
Win 7 x64
I took the advices from https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script to get it running. I have some issues with the following advice:
If you are inside a Scrapy project there are some additional helpers you can use to import those components within the project. You can automatically import your spiders passing their name to CrawlerProcess, and use get_project_settings to get a Settings instance with your project settings.
So what is meant wiht "inside a Scrapy project"? Of course I have to import the libraries and have the dependencies installed, but I want to avoid starting the crawling process with scrapy crawl xyz.
Here's the code of myScrapy.py
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Item, Field
import os, argparse
#Initialization of directories
projectDir = os.path.dirname(os.path.realpath('__file__'))
generalOutputDir = os.path.join(projectDir, 'output')
parser = argparse.ArgumentParser()
parser.add_argument("url", help="The url which you want to scan", type=str)
args = parser.parse_args()
urlToScan = args.url
#Stripping of given URL to get only the host + TLD
if "https" in urlToScan:
urlToScanNoProt = urlToScan.replace("https://","")
print "used protocol: https"
if "http" in urlToScan:
urlToScanNoProt = urlToScan.replace("http://","")
print "used protocol: http"
class myItem(Item):
url = Field()
class mySpider(CrawlSpider):
name = "linkspider"
allowed_domains = [urlToScanNoProt]
start_urls = [urlToScan,]
rules = (Rule(LinkExtractor(), callback='parse_url', follow=True), )
def generateDirs(self):
if not os.path.exists(generalOutputDir):
os.makedirs(generalOutputDir)
specificOutputDir = os.path.join(generalOutputDir, urlToScanNoProt)
if not os.path.exists(specificOutputDir):
os.makedirs(specificOutputDir)
return specificOutputDir
def parse_url(self, response):
for link in LinkExtractor().extract_links(response):
item = myItem()
item['url'] = response.url
specificOutputDir = self.generateDirs()
filename = os.path.join(specificOutputDir, response.url.split("/")[-2] + ".html")
with open(filename, "wb") as f:
f.write(response.body)
return CrawlSpider.parse(self, response)
return item
process = CrawlerProcess(get_project_settings())
process.crawl(mySpider)
process.start() # the script will block here until the crawling is finished
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)? I think it's an issue with getting the settings as they are set within a "normal" scrapy-project (where you have to run scrapy crawl xyz) because the putput says
2016-11-18 10:38:42 [scrapy] INFO: Overridden settings: {}
I hope you understand my question(s) (English isn't my native language... ;))
Thanks in advance!
When running a crawl with a script (and not scrapy crawl), one of the options is indeed to use CrawlerProcess.
So what is meant wiht "inside a Scrapy project"?
What is meant is if you run your scripts at the root of a scrapy project created with scrapy startproject, i.e. where you have the scrapy.cfg file with the [settings] section among others.
Why do I have to call process.crawl(mySpider) and not process.crawl(linkspider)?
Read the documentation on scrapy.crawler.CrawlerProcess.crawl() for details:
Parameters:
crawler_or_spidercls (Crawler instance, Spider subclass or string) – already created crawler, or a spider class or spider’s name inside the project to create it
I don't know that part of the framework, but I suspect with a spider name only -- I believe you meant and not process.crawl("linkspider") , and outside of a scrapy project, scrapy does not know where to look for spiders (it has no hint). Hence, to tell scrapy which spider to run, might as well give the class directly (and not an instance of a spider class).
get_project_settings() is a helper, but essentially, CrawlerProcess needs to be initialized with a Settings object (see https://docs.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerProcess)
In fact, it also accepts a settings dict (which is internally converted into a Settings instance), as shown in the example you linked to:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
So depending on what settings you need to override compared to scrapy defaults, you need to do something like:
process = CrawlerProcess({
'SOME_SETTING_KEY': somevalue,
'SOME_OTHERSETTING_KEY': someothervalue,
...
})
process.crawl(mySpider)
...
File setup:
...\Project_Folder
...\Project_Folder\Project.py
...\Project_folder\Script\TestScript.py
I'm attempting to have Project.py import modules from the folder Script based on user input.
Python Version: 3.4.2
Ideally, the script would look something like
q = str(input("Input: "))
from Script import q
However, python does not recognize q as a variable when using import.
I've tried using importlib, however I cannot figure out how to import from the Script folder mentioned above.
import importlib
q = str(input("Input: "))
module = importlib.import_module(q, package=None)
I'm not certain where I would implement the file path.
Repeat of my answer originally posted at How to import a module given the full path?
as this is a Python 3.4 specific question:
This area of Python 3.4 seems to be extremely tortuous to understand, mainly because the documentation doesn't give good examples! This was my attempt using non-deprecated modules. It will import a module given the path to the .py file. I'm using it to load "plugins" at runtime.
def import_module_from_file(full_path_to_module):
"""
Import a module given the full path/filename of the .py file
Python 3.4
"""
module = None
try:
# Get module name and path from full path
module_dir, module_file = os.path.split(full_path_to_module)
module_name, module_ext = os.path.splitext(module_file)
# Get module "spec" from filename
spec = importlib.util.spec_from_file_location(module_name,full_path_to_module)
module = spec.loader.load_module()
except Exception as ec:
# Simple error printing
# Insert "sophisticated" stuff here
print(ec)
finally:
return module
# load module dynamically
path = "<enter your path here>"
module = import_module_from_file(path)
# Now use the module
# e.g. module.myFunction()
I did this by defining the entire import line as a string, formatting the string with q and then using the exec command:
imp = 'from Script import %s' %q
exec imp
I've downloaded the spider.py 0.5 module from here. Inside the spider.py file there are lots of functions, one of them is:-
def webspider(self, b=None, w=200, d=5, t=None):
'''Returns two lists of child URLs and paths
b -- base web URL (default: None)
w -- amount of resources to crawl (default: 200)
d -- depth in hierarchy to crawl (default: 5)
t -- number of threads (default: None)'''
if b: self.weburls(b, w, d, t)
return self.webpaths(), self.urls
I've created a new file in the same directory called run.py with the following code:-
import spider
webspider(b='http://example.com', w=200, d=5, t=5)
When i execute run.py i'm getting the following message:
NameError: name 'webspider' is not defined
Any ideas on how I can correctly use this module? I would like all links found to be saved into a file called urls.txt.
You should call it like this:
import spider
spider.webspider(b='http://example.com', w=200, d=5, t=5)
Or you can only import webspider:
from spider import webspider
webspider(b='http://example.com', w=200, d=5, t=5)
You can rename imported method:
from spider import webspider as myspider
myspider(b='http://example.com', w=200, d=5, t=5)
All,
I'm trying to fully automate my scraping, which is formed by 3 steps:
1- Get the list of index pages for advertisements (Non-scrapy work, for various reasons)
2- Get the list of advertisement URLs from the index pages obtained in step one (Scrapy work)
My scrapy project is in the usual directory:
C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py
(name of the spider inside the "GetAdUrls_spider" file is (name = "getadurls"))
My script to automate the step 1 and 2 is in this directory:
C:\Website_DATA\SCRIPTS\StepByStepLauncher.py
I have tried using the Scrapy documentation to import the crawler and run from inside the script using the following code:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here
I keep getting the error "No module named GetAdUrlsFromIndex.spiders.GetAdUrls_spider" when I try to run this script unfortunately.. I tried changing working directory to several few different locations, I played around with names, nothing seemed to work..
Would appreciate any help.. Thanks!
If you do have __init__.py in C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex and C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders then try modifying your script this way
import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls
spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here