I want to be able to run the Scrapy web crawling framework from within Django. Scrapy itself only provides a command line tool scrapy to execute its commands, i.e. the tool was not intentionally written to be called from an external program.
The user Mikhail Korobov came up with a nice solution, namely to call Scrapy from a Django custom management command. For convenience, I repeat his solution here:
# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
return super(Command, self).run_from_argv(argv)
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
Instead of calling e.g. scrapy crawl domain.com I can now do python manage.py scrapy crawl domain.com from within a Django project. However, the options of a Scrapy command are not parsed at all. If I do python manage.py scrapy crawl domain.com -o scraped_data.json -t json, I only get the following response:
Usage: manage.py scrapy [options]
manage.py: error: no such option: -o
So my question is, how to extend the custom management command to adopt Scrapy's command line options?
Unfortunately, Django's documentation of this part is not very extensive. I've also read the documentation of Python's optparse module but afterwards it was not clearer to me. Can anyone help me in this respect? Thanks a lot in advance!
Okay, I have found a solution to my problem. It's a bit ugly but it works. Since the Django project's manage.py command does not accept Scrapy's command line options, I split the options string into two arguments which are accepted by manage.py. After successful parsing, I rejoin the two arguments and pass them to Scrapy.
That is, instead of writing
python manage.py scrapy crawl domain.com -o scraped_data.json -t json
I put spaces in between the options like this
python manage.py scrapy crawl domain.com - o scraped_data.json - t json
My handle function looks like this:
def handle(self, *args, **options):
arguments = self._argv[1:]
for arg in arguments:
if arg in ('-', '--'):
i = arguments.index(arg)
new_arg = ''.join((arguments[i], arguments[i+1]))
del arguments[i:i+2]
arguments.insert(i, new_arg)
from scrapy.cmdline import execute
execute(arguments)
Meanwhile, Mikhail Korobov has provided the optimal solution. See here:
# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
self.execute()
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
I think you're really looking for Guideline 10 of the POSIX argument syntax conventions:
The argument -- should be accepted as a delimiter indicating the end of options.
Any following arguments should be treated as operands, even if they begin with
the '-' character. The -- argument should not be used as an option or as an operand.
Python's optparse module behaves this way, even under windows.
I put the scrapy project settings module in the argument list, so I can create separate scrapy projects in independent apps:
# <app>/management/commands/scrapy.py
from __future__ import absolute_import
import os
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def handle(self, *args, **options):
os.environ['SCRAPY_SETTINGS_MODULE'] = args[0]
from scrapy.cmdline import execute
# scrapy ignores args[0], requires a mutable seq
execute(list(args))
Invoked as follows:
python manage.py scrapy myapp.scrapyproj.settings crawl domain.com -- -o scraped_data.json -t json
Tested with scrapy 0.12 and django 1.3.1
Related
I'm working on a project where I would like to run a same script but with two different softwares api.
What I have :
-One module for each software where I have the same classes and methods names.
-One construction script where I need to call these classes and method.
I would like to not duplicate the construction code, but rather run the same bit of code just by changing the imported module.
Exemple :
first_software_module.py
import first_software_api
class Box(x,y,z):
init():
first_software_api.makeBox()
second_software_module.py
import second_software_api
class Box(x,y,z):
init():
second_software_api.makeBox()
construction.py
first_box = Box(1,2,3)
And I would like to run construction.py with the first module, then with the second module.
I tryed with imports, execfile, but none of these solutions seems to work.
What i would like to do :
import first_software_module
run construction.py
import second_software_module
run construction.py
You could try by passing a command line argument to construction.py.
construction.py
import sys
if len(sys.argv) != 2:
sys.stderr.write('Usage: python3 construction.py <module>')
exit(1)
if sys.argv[1] == 'first_software_module':
import first_software_module
elif sys.argv[1] == 'second_software_module':
import second_software_module
box = Box(1, 2, 3)
You could then call construction.py with each import type from a shell script, say main.sh.
main.sh
#! /bin/bash
python3 construction.py first_software_module
python3 construction.py second_software_module
Make the shell script executable using chmod +x main.sh. Run it as ./main.sh.
Alternatively, if you do not want to use a shell script, and want to do it in pure Python, you could do the following:
main.py
import subprocess
subprocess.run(['python3', 'construction.py', 'first_software_module'])
subprocess.run(['python3', 'construction.py', 'second_software_module'])
and run main.py as you normally would using python3 main.py.
You can pass a command-line argument that will tell your script which module to import. There are many ways to do this, but I'm going to demonstrate with the argparse module
import argparse
parser = argparse.ArgumentParser(description='Run the construction')
parser.add_argument('--module', nargs=1, type=str, required=True, help='The module to use for the construction', choices=['module1', 'module2'])
args = parser.parse_args()
Now, args.module will contain the contents of the argument you passed. Using this string and an if-elif ladder (or the match-case syntax in 3.10+) to import the correct module, and alias it as (let's say) driver.
if args.module[0] == "module1":
import first_software_api as driver
print("Using first_software_api")
elif args.module[0] == "module2":
import second_software_api as driver
print("Using second_software_api")
Then, use driver in your Box class:
class Box(x,y,z):
def __init__(self):
driver.makeBox()
Say we had this in a file called construct.py. Running python3 construct.py --help gives:
usage: construct.py [-h] --module {module1,module2}
Run the construction
optional arguments:
-h, --help show this help message and exit
--module {module1,module2}
The module to use for the construction
Running python3 construct.py --module module1 gives:
Using first_software_api
I have a scrapy spider that imports a dictionary from another module. My main contentspider.py that contains the spider also contains an import statement from spider_project.spider_project.updated_kw import translated_kw_dicts.
from spider_project.spider_project.updated_kw import translated_kw_dicts
class ContentSpider(CrawlSpider):
name = 'content_spider'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
print(traslated_kw_dicts)
This import statement works fine when the spider is running as a script, but when I run it via conda then I get an error:
spider_project.spider_project.updated_kw import translated_kw_dicts
ModuleNotFoundError: No module named 'spider_project.spider_project'
From conda I am running the spider as it should be - from the directory containing .cfg file: c:/.../.../spider_project
If I will change the import statement to from spider_project.updated_kw import translated_kw_dicts (notice I'm taking out the first spider_project directry) then via conda the spider runs fine, but I get an error Cannot find reference 'updated_kw' in 'imported module spider_project' in my script.
Can someone advise why this is happening?
Here is my project structure:
when you click on the long line up in the file explorer then it should go active. You can copie it and use the fail reading technic.
I'm trying to print to console before and after processing that takes a while in a Django management command, like this:
import requests
import xmltodict
from django.core.management.base import BaseCommand
def get_all_routes():
url = 'http://busopen.jeju.go.kr/OpenAPI/service/bis/Bus'
r = requests.get(url)
data = xmltodict.parse(r.content)
return data['response']['body']['items']['item']
class Command(BaseCommand):
help = 'Updates the database via Bus Info API'
def handle(self, *args, **options):
self.stdout.write('Saving routes ... ', ending='')
for route in get_all_routes():
route_obj = Route(
route_type=route['routeTp'], route_id=route['routeId'], route_number=route['routeNum'])
route_obj.save()
self.stdout.write('done.')
In the above code, Saving routes ... is expected to print before the loop begins, and done. right next to it when the loop completes so that it looks like Saving routes ... done. in the end.
However, the former doesn't print until the loop completes, when both strings finally print at the same time, which is not what I expected.
I found this question, where the answer suggests flushing the output i.e. self.stdout.flush(), so I added that to my code:
def handle(self, *args, **options):
self.stdout.write('Saving routes ... ', ending='')
self.stdout.flush()
for route in get_all_routes():
route_obj = Route(
route_type=route['routeTp'], route_id=route['routeId'], route_number=route['routeNum'])
route_obj.save()
self.stdout.write('done.')
Still, the result remains unchanged.
What could have I done wrong?
The thing to keep in mind is you're using self.stdout (as suggested in the Django docs), which is BaseCommand's override of Python's standard sys.stdout. There are two main differences between the 2 relevant to your problem:
The default "ending" in BaseCommand's version of self.stdout.write() is a new-line, forcing you to use the ending='' parameter, unlike sys.stdout.write() that has an empty ending as the default. This in itself is not causing your problem.
The BaseCommand version of flush() does not really do anything (who would have thought?). This is a known bug: https://code.djangoproject.com/ticket/29533
So you really have 2 options:
Not use BaseCommand's self.stdout but instead use sys.stdout, in which case the flush does work
Force the stdout to be totally unbuffered while running the management command by passing the "-u" parameter to python. So instead of running python manage.py <subcommand>, run python -u manage.py <subcommand>
Hope this helps.
Have you try to set PYTHONUNBUFFERED environment variable?
Is it allowed to group custom Django commands to separate folders inside the same Django app?
I have a lot of them and wanted to group them logically by purpose. Created folders but Django can't find them.
Maybe I'm trying to run them wrong. Tried:
python manage.py process_A_related_data
the same plus imported all commands in __init__.py
python manage.py folderA process_A_related_data
python manage.py folderA.process_A_related_data
python manage.py folderA/process_A_related_data
Got following error:
Unknown command: 'folderA/process_A_related_data'
Type 'manage.py help' for usage.
I think you can create a basic custom command which will run other commands from relevent folders. Here is an approach you can take:
First make a folder structure like this:
management/
commands/
folder_a/
process_A_related_data.py
folder_b/
process_A_related_data.py
process_data.py
Then inside process_data.py, update the command like this:
from django.core import management
from django.core.management.base import BaseCommand
import importlib
class Command(BaseCommand):
help = 'Folder Process Commands'
def add_arguments(self, parser):
parser.add_argument('-u', '--use', type=str, nargs='?', default='folder_a.process_A_related_data')
def handle(self, *args, **options):
try:
folder_file_module = options['use'] if options['use'].startswith('.') else '.' + options['use']
command = importlib.import_module(folder_file_module, package='your_app.management.commands')
management.call_command(command.Command())
except ModuleNotFoundError:
self.stderr.write(f"No relevent folder found: {e.name}")
Here I am using call_command method to call other managment commands.
Then run commands like this:
python manage.py process_data --use folder_a.process_A_related_data
Finally, if you want to run commands like python manage.py folder_a.process_A_related_data, then probably you need to change in manage.py. Like this:
import re
...
try:
from django.core.management import execute_from_command_line
except ImportError as exc:
raise ImportError(
"Couldn't import Django. Are you sure it's installed and "
"available on your PYTHONPATH environment variable? Did you "
"forget to activate a virtual environment?"
) from exc
if re.search('folder_[a-z].*', sys.argv[-1]):
new_arguments = sys.argv[:-1] + ['process_data','--use', sys.argv[-1]]
execute_from_command_line(new_arguments)
else:
execute_from_command_line(sys.argv)
You should be able to partition the code by using mixins (I have not tried this in this context, though)
A standard management command looks like
from django.core.management.base import BaseCommand
class Command(BaseCommand):
help = 'FIXME A helpful comment goes here'
def add_arguments(self, parser):
parser.add_argument( 'name', ...)
# more argument definitions
def handle(self, *args, **options):
# do stuff
Which can probably be replaced by a "stub" in app/management/commands:
from wherever.commands import FooCommandMixin
from django.core.management.base import BaseCommand
class Command(FooCommandMixin, BaseCommand):
# autogenerated -- do not put any code in here!
pass
and in wherever/commands
class FooCommandMixin( object):
help = 'FIXME A helpful comment goes here'
def add_arguments(self, parser):
parser.add_argument( 'name', ...)
# more argument definitions
def handle(self, *args, **options):
# do the work
It would not be hard to write a script to go through a list of file names or paths (using glob.glob) using re.findall to identify appropriate class declarations, and to (re)generate a matching stub for each in the app's management/commands folder.
Also/instead Python's argparse allows for the definition of sub-commands. So you should be able to define a command that works like
./manage.py foo bar --aa --bb something --cc and
./manage.py foo baz --bazzy a b c
where the syntax after foo is determined by the next word (bar or baz or ...). Again I have no experience of using subcommands in this context.
I found no mention of support for this feature in the release notes. It looks to be that this is still not supported as of version Django 3.0. I would suggest that you use meaningful names for your files that help you specify. You could always come up w/ a naming convention!
A workaround could be: create a specific Django "satellite" app for each group of management commands.
In recent version of Django, the requirements for a Python module to be an app are minimal: you won't need to provide any fake models.py or other specific files as happened in the old days.
While far from perfect from a stylistic point of view, you still gain a few advantages:
no need to hack the framework at all
python manage.py will list the commands grouped by app
you can control the grouping by providing suitable names to the apps
you can use these satellite apps as container for specific unit tests
I always try to avoid fighting against the framework, even when this means to compromise, and sometimes accept it's occasional design limitations.
Here is the current state of my twistd plugin, which is located in project_root/twisted/plugins/my_plugin.py:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from zope.interface import implements
from twisted.plugin import IPlugin
from twisted.python.usage import Options
from twisted.application import internet, service
from mylib.io import MyFactory
class Options(Options):
"""Flags and options for the plugin."""
optParameters = [
('sock', 's', '/tmp/io.sock', 'Path to IO socket'),
]
class MyServiceMaker(object):
implements(service.IServiceMaker, IPlugin)
tapname = "myplugin"
description = "description for my plugin"
options = Options
def makeService(self, options):
return internet.UNIXServer(options['sock'], MyFactory())
There is no __init__.py file in project_root/twisted/plugins/
The output of twistd doesn't show my plugin when run from the project's root directory
I installed my library via python setup.py develop --user and it is importable from anywhere
Any ideas?
As suspected, it was something very simple: I needed to instantiate an instance of MyServiceMaker, so simply adding service_maker = MyServiceMaker() at the bottom of the script fixes the issue.