Scrapy: using a separate test database for integration tests - python

I have a scrapy project that writes the data it scrapes to a database. It was based on this great tutorial: http://newcoder.io/scrape/part-3/
I have hit an issue now that I am trying to write some integration tests for the project. I am following the guidelines here: Scrapy Unit Testing
It's not clear to me how best to pass in the appropriate database settings. I'd like the tests to use their own database that I can ensure is in a known state before the tests start running.
So just import settings won't do the trick as, if the project is being run in test mode then it needs to use a different settings file.
I am familiar with Ruby on Rails projects where you specify a RAILS_ENV environment variable, and based on this environment variable, the framework will use settings from different files. Is there a similar concept that can apply when testing scrapy projects? Or is there a more pythonic alternative approach?

In the end I edited the settings.py file to support using an environment variable to determine which additional files to get the settings from, like this:
from importlib import import_module
import logging
import os
SCRAPY_ENV=os.environ.get('SCRAPY_ENV',None)
if SCRAPY_ENV == None:
raise ValueError("Must set SCRAPY_ENV environment var")
# Load if file exists; incorporate any names started with an
# uppercase letter into globals()
def load_extra_settings(fname):
if not os.path.isfile("config/%s.py" % fname):
logger = logging.getLogger(__name__)
logger.warning("Couldn't find %s, skipping" % fname)
return
mdl=import_module("config.%s" % fname)
names = [x for x in mdl.__dict__ if x[0].isupper()]
globals().update({k: getattr(mdl,k) for k in names})
load_extra_settings("secrets")
load_extra_settings("secrets_%s" % SCRAPY_ENV)
load_extra_settings("settings_%s" % SCRAPY_ENV)
I made an example github repo showing how this worked: https://github.com/alanbuxton/scrapy_local_settings
Keen to find out if there is a neater way of doing it.

Related

How to use different .env files with python-decouple

I am working on a django project that I need to run it with Docker. In this project I have multiples .env files: .env.dev, .env.prod, .env.staging. Is there a right way to manage all this file with the package python-decouple? I've search for a workaround to deal with this challenge and do not find any kind of answer, not even on the official documentation.
Can I use something like:
# dont works that way, it's just a dummie example
python manage.py runserver --env-file=.env.prod
or maybe any way to setting or override the file I need to use?
Instead of importing decouple.config and doing the usual config('SOME_ENV_VAR'), create a new decouple.Config object using RepositoryEnv('/path/to/.env.prod').
from decouple import Config, RepositoryEnv
DOTENV_FILE = '/home/user/my-project/.env.prod'
env_config = Config(RepositoryEnv(DOTENV_FILE))
# use the Config().get() method as you normally would since
# decouple.config uses that internally.
# i.e. config('SECRET_KEY') = env_config.get('SECRET_KEY')
SECRET_KEY = env_config.get('SECRET_KEY')

What is the best method for setting up a config file in Python

I realise this question as been asked before (What's the best practice using a settings file in Python?) but seeing as this was asked 7 years ago, I feel it is valid to discuss again seeing as how technologies have evolved.
I have a python project that requires different configurations to be used based on the value of an environment variable. Since making use of the environment variable to choose a config file is simple enough, my question is as follows:
What format is seen as the best practice in the software industry for setting up a configuration file in python, when multiple configurations are needed based on the environment?
I realise that python comes with a ConfigParser module but I was wondering if it might be better to use a format such as YAML or JSON because of there raise in popularity due to their ease of use across languages. Which format is seen as easier to maintain when you have multiple configurations?
If you really want to use an environment-based YAML configuration, you could do so like this:
config.py
import yaml
import os
config = None
filename = getenv('env', 'default').lower()
script_dir = os.path.dirname(__file__)
abs_file_path = os.path.join(script_dir, filename)
with open(abs_file_path, 'r') as stream:
try:
config = yaml.load(stream)
except yaml.YAMLError as exc:
print(exc)
I think looking at the standard configuration for a Python Django settings module is a good example of this, since the Python Django web framework is extremely popular for commercial projects and therefore is representative of the software industry.
It doesn't get too fancy with JSON or YAML config files - It simply uses a python module called settings.py that can be imported into any other module that needs to access the settings. Environment variable based settings are also defined there. Here is a link to an example settings.py file for Django on Github:
https://github.com/deis/example-python-django/blob/master/helloworld/settings.py
This is really late to the party, but this is what I use and I'm pleased with it (if you're open to a pure Python solution). I like it because my configurations can be set automatically based on where this is deployed using environment variables. I haven't been using this that long so if someone sees an issue, I'm all ears.
Structure:
|--settings
|--__init__.py
|--config.py
config.py
class Common(object):
XYZ_API_KEY = 'AJSKDF234328942FJKDJ32'
XYZ_API_SECRET = 'KDJFKJ234df234fFW3424##ewrFEWF'
class Local(Common):
DB_URI = 'local/db/uri'
DEBUG = True
class Production(Common):
DB_URI = 'remote/db/uri'
DEBUG = False
class Staging(Production):
DEBUG = True
__init__.py
from settings.config import Local, Production, Staging
import os
config_space = os.getenv('CONFIG_SPACE', None)
if config_space:
if config_space == 'LOCAL':
auto_config = Local
elif config_space == 'STAGING':
auto_config = Staging
elif config_space == 'PRODUCTION':
auto_config = Production
else:
auto_config = None
raise EnvironmentError(f'CONFIG_SPACE is unexpected value: {config_space}')
else:
raise EnvironmentError('CONFIG_SPACE environment variable is not set!')
If my environment variable is set in each place where my app exists, I can bring this into my modules as needed:
from settings import auto_config as cfg
That really depends on your requirements, rather than the format's popularity. For instance, if you just need simple key-value pairs, an INI file would be more than enough. As soon as you need complex structures (e.g., arrays or dictionaries), I'd go for JSON or YAML. JSON simply stores data (it's more intended for automated data flow between systems), while YAML is better for human-generated (or maintained, or read) files, as it has comments, you can reference values elsewhere in the file... And on top of that, if you want robustness, flexibility, and means to check the correct structure of the file (but don't care much about the manual edition of the data), I'd go for XML.
I recommend giving trapdoor a try for turn-key configuration (disclaimer: I'm the author of trapdoor).
I also like to take advantage of the fact that you do not have to compile Python source and use plain Python files for configuration. But in the real world you may have multiple environments, each requires a different configuration, and you may also want to read some (mostly sensitive) information from env vars or files that are not in source control (to prevent committing those by mistake).
That's why I wrote this library: https://github.com/davidohana/kofiko,
which let you use plain Python files for configuration, but is also able to override those config settings from .ini or env-vars, and also support customization for different environments.
Blog post about it: https://medium.com/swlh/code-first-configuration-approach-for-python-f975469433b9

Good way to collect programmatically generated test suites in nose or pytest

Say I've got a test suite like this:
class SafeTests(unittest.TestCase):
# snip 20 test functions
class BombTests(unittest.TestCase):
# snip 10 different test cases
I am currently doing the following:
suite = unittest.TestSuite()
loader = unittest.TestLoader()
safetests = loader.loadTestsFromTestCase(SafeTests)
suite.addTests(safetests)
if TARGET != 'prod':
unsafetests = loader.loadTestsFromTestCase(BombTests)
suite.addTests(unsafetests)
unittest.TextTestRunner().run(suite)
I have major problem, and one interesting point
I would like to be using nose or py.test (doestn't really matter which)
I have a large number of different applications that are exposing these test
suites via entry points.
I would like to be able to aggregate these custom tests across all installed
applications so I can't just use a clever naming convention. I don't
particularly care about these being exposed through entry points, but I
do care about being able to run tests across applications in
site-packages. (Without just importing... every module.)
I do not care about maintaining the current dependency on
unittest.TestCase, trashing that dependency is practically a goal.
EDIT This is to confirm that #Oleksiy's point about passing args to
nose.run does in fact work with some caveats.
Things that do not work:
passing all the files that one wants to execute (which, weird)
passing all the modules that one wants to execute. (This either executes
nothing, the wrong thing, or too many things. Interesting case of 0, 1 or
many, perhaps?)
Passing in the modules before the directories: the directories have to come
first, or else you will get duplicate tests.
This fragility is absurd, if you've got ideas for improving it I welcome
comments, or I set up
a github repo with my
experiments trying to get this to work.
All that aside, The following works, including picking up multiple projects
installed into site-packages:
#!python
import importlib, os, sys
import nose
def runtests():
modnames = []
dirs = set()
for modname in sys.argv[1:]:
modnames.append(modname)
mod = importlib.import_module(modname)
fname = mod.__file__
dirs.add(os.path.dirname(fname))
modnames = list(dirs) + modnames
nose.run(argv=modnames)
if __name__ == '__main__':
runtests()
which, if saved into a runtests.py file, does the right thing when run as:
runtests.py project.tests otherproject.tests
For nose you can have both tests in place and select which one to run using attribute plugin, which is great for selecting which tests to run. I would keep both tests and assign attributes to them:
from nose.plugins.attrib import attr
#attr("safe")
class SafeTests(unittest.TestCase):
# snip 20 test functions
class BombTests(unittest.TestCase):
# snip 10 different test cases
For you production code I would just call nose with nosetests -a safe, or setting NOSE_ATTR=safe in your os production test environment, or call run method on nose object to run it natively in python with -a command line options based on your TARGET:
import sys
import nose
if __name__ == '__main__':
module_name = sys.modules[__name__].__file__
argv = [sys.argv[0], module_name]
if TARGET == 'prod':
argv.append('-a slow')
result = nose.run(argv=argv)
Finally, if for some reason your tests are not discovered you can explicitly mark them as test with #istest attribute (from nose.tools import istest)
This turned out to be a mess: Nose pretty much exclusively uses the
TestLoader.load_tests_from_names function (it's the only function tested in
unit_tests/test_loader)
so since I wanted to actually load things from an arbitrary python object I
seemed to need to write my own figure out what kind of load function to use.
Then, in addition, to correctly get things to work like the nosetests script
I needed to import a large number of things. I'm not at all certain that this
is the best way to do things, not even kind of. But this is a stripped down
example (no error checking, less verbosity) that is working for me:
import sys
import types
import unittest
from nose.config import Config, all_config_files
from nose.core import run
from nose.loader import TestLoader
from nose.suite import ContextSuite
from nose.plugins.manager import PluginManager
from myapp import find_test_objects
def load_tests(config, obj):
"""Load tests from an object
Requires an already configured nose.config.Config object.
Returns a nose.suite.ContextSuite so that nose can actually give
formatted output.
"""
loader = TestLoader()
kinds = [
(unittest.TestCase, loader.loadTestsFromTestCase),
(types.ModuleType, loader.loadTestsFromModule),
(object, loader.loadTestsFromTestClass),
]
tests = None
for kind, load in kinds.items():
if isinstance(obj, kind) or issubclass(obj, kind):
log.debug("found tests for %s as %s", obj, kind)
tests = load(obj)
break
suite = ContextSuite(tests=tests, context=obj, config=config)
def main():
"Actually configure the nose config object and run the tests"
config = Config(files=all_config_files(), plugins=PluginManager())
config.configure(argv=sys.argv)
tests = []
for group in find_test_objects():
tests.append(load_tests(config, group))
run(suite=tests)
If your question is, "How do I get pytest to 'see' a test?", you'll need to prepend 'test_' to each test file and each test case (i.e. function). Then, just pass the directories you want to search on the pytest command line and it will recursively search for files that match 'test_XXX.py', collect the 'test_XXX' functions from them and run them.
As for the docs, you can try starting here.
If you don't like the default pytest test collection method, you can customize it using the directions here.
If you are willing to change your code to generate a py.test "suite" (my definition) instead of a unittest suite (tech term), you may do so easily. Create a file called conftest.py like the following stub
import pytest
def pytest_collect_file(parent, path):
if path.basename == "foo":
return MyFile(path, parent)
class MyFile(pytest.File):
def collect(self):
myname="foo"
yield MyItem(myname, self)
yield MyItem(myname, self)
class MyItem(pytest.Item):
SUCCEEDED=False
def __init__(self, name, parent):
super(MyItem, self).__init__(name, parent)
def runtest(self):
if not MyItem.SUCCEEDED:
MyItem.SUCCEEDED = True
print "good job, buddy"
return
else:
print "you sucker, buddy"
raise Exception()
def repr_failure(self, excinfo):
return ""
Where you will be generating/adding your code into your MyFile and MyItem classes (as opposed to the unittest.TestSuite and unittest.TestCase). I kept the naming convention of MyFile class that way, because it is intended to represent something that you read from a file, but of course you can basically decouple it (as I've done here). See here for an official example of that. The only limit is that in the way I've written this foo must exist as a file, but you can decouple that too, e.g. by using conftest.py or whatever other file name exist in your tree (and only once, otherwise everything will run for each files that matches -- and if you don't do the if path.basename test for every file that exists in your tree!!!)
You can run this from command line with
py.test -whatever -options
or programmactically from any code you with
import pytest
pytest.main("-whatever -options")
The nice thing with py.test is that you unlock many very powerful plugings such as html report

In the Pyramid web framework, how do I source sensitive settings into development.ini / production.ini from an external file?

I'd like to keep development.ini and production.ini under version control, but for security reason would not want the sqlalchemy.url connection string to be stored, as this would contain the username and password used for the database connection.
What's the canonical way, in Pyramid, of sourcing this setting from an additional external file?
Edit
In addition to solution using the environment variable, I came up with this solution after asking around on #pyramid:
def main(global_config, **settings):
""" This function returns a Pyramid WSGI application.
"""
# Read db password from config file outside of version control
secret_cfg = ConfigParser()
secret_cfg.read(settings['secrets'])
dbpass = secret_cfg.get("secrets", "dbpass")
settings['sqlalchemy.url'] = settings['connstr'] % (dbpass,)
I looked into this a lot and played with a lot of different approaches. However, Pyramid is so flexible, and the .ini config parser is so minimal in what it does for you, that there doesn't seem to be a de facto answer.
In my scenario, I tried having a production.example.ini in version control at first that got copied on the production server with the details filled in, but this got hairy, as updates to the example didn't get translated to the copy, and so the copy had to be re-created any time a change was made. Also, I started using Heroku, so files not in version control never made it into the deployment.
Then, there's the encrypted config approach. Which, I don't like the paradigm. Imagine a sysadmin being responsible for maintaining the production environment, but he or she is unable to change the location of a database or environment-specific setting without running it back through version control. It's really nice to have the separation between environment and code as much as possible so those changes can be made on the fly without version control revisions.
My ultimate solution was to have some values that looked like this:
[app:main]
sqlalchemy.url = ${SQLALCHEMY_URL}
Then, on the production server, I would set the environment variable SQLALCHEMY_URL to point to the database. This even allowed me to use the same configuration file for staging and production, which is nice.
In my Pyramid init, I just expanded the environment variable value using os.path.expandvars:
sqlalchemy_url = os.path.expandvars(settings.get('sqlalchemy.url'))
engine = create_engine(sqlalchemy_url)
And, if you want to get fancy with it and automatically replace all the environment variables in your settings dictionary, I made this little helper method for my projects:
def expandvars_dict(settings):
"""Expands all environment variables in a settings dictionary."""
return dict((key, os.path.expandvars(value)) for
key, value in settings.iteritems())
Use it like this in your main app entry point:
settings = expandvars_dict(settings)
The whole point of the separate ini files in Pyramid is that you do not have to version control all of them and that they can contain different settings for different scenarios (development/production/testing). Your production.ini almost always should not be in the same VCS as your source code.
I found this way for loading secrets from a extra configuration and from the env.
from pyramid.config import Configurator
from paste.deploy import appconfig
from os import path
__all__ = [ "main" ]
def _load_secrets(global_config, settings):
""" Helper to load secrets from a secrets config and
from env (in that order).
"""
if "drawstack.secrets" in settings:
secrets_config = appconfig('config:' + settings["drawstack.secrets"],
relative_to=path.dirname(global_config['__file__']))
for k, v in secrets_config.items():
if k == "here" or k == "__file__":
continue
settings[k] = v
if "ENV_DB_URL" in global_config:
settings["sqlalchemy.url"] = global_config["ENV_DB_URL"]
def main(global_config, **settings):
""" This function returns a Pyramid WSGI application.
"""
_load_secrets(global_config, settings)
config = Configurator(settings=settings)
config.include('pyramid_jinja2')
config.include('.models')
config.include('.routes')
config.scan()
return config.make_wsgi_app()
The code above, will load any variables from the value of the config key drawstack.secrets and after that it tries to load DB_URL from the enviornment.
drawstack.secrets can be relative to the original config file OR absolute.

Running scripts within Pyramid framework (ie without a server)

I have a fair bit of experience with PHP frameworks and Python for scripting so am now taking the step to Pyramid.
I'd like to know what is the 'correct' way to run a script in Pyramid. That is, how should I set it up so that it is part of the application and has access to config and thus database but does not run through paster (or whatever WSGI).
As an example, say I have a web application which while a user is offline grabs Facebook updates through a web service. I want to write a script to poll that service and store in the database ready for next login.
How should I do this in terms of:
Adding variables in the ini file
Starting the script correctly
I understand the basics of Python modules and packages; however I don't fully understand Configurator/Paster/package setup, wherein I suspect the answer lies.
Thanks
Update:
Thanks, this seems along the lines of what I am looking for. I note that you have to follow a certain structure (eg have summary and parser attributes set) and that the function called command() will always be run. My test code now looks something like this:
class AwesomeCommand(Command):
max_args = 2
min_args = 2
usage = "NAME"
# These are required
summary = "Say hello!"
group_name = "My Package Name"
# Required:
parser = Command.standard_parser(verbose=True)
def command(self):
# Load the config file/section
config_file, section_name = self.args
# What next?
I'm now stuck as to how to get the settings themselves. For example, in init.py you can do this:
engine = engine_from_config(settings, 'sqlalchemy.')
What do I need to do to transform the config file into the settings?
EDIT: The (simpler) way to do this in Pylons is here:
Run Pylons controller as separate app?
As of Pyramid 1.1, this is handled by the framework:
http://docs.pylonsproject.org/projects/pyramid/en/latest/narr/commandline.html#writing-a-script
paster starts an application given an ini file that describes that application. the "serve" command is a built in command for starting a wsgi application and serving it. BUT, you can write other commands.
from paste.script.command import Command
class AwesomeCommand(Command):
def command(self):
print "the awesome thing it does"
and then register them as entry points in your setup.py.
setup(...
entry_points="""
[paste.app_factory]
.....
[paste.global_paster_command]
myawesome-command = mypackage.path.to.command:AwesomeCommand """)
pyramid adds it's own commands this way like the pshell command.
After going to the pylons discuss list, I came up with an answer. Hope this helps somebody:
#Bring in pyramid application--------------------------------------
import pyramid
from paste.deploy import appconfig
config_file = '/path_to_config_file/configname.ini'
name = 'app_name'
config_name = 'config:%s' % config_file
here_dir = os.getcwd()
conf = appconfig(config_name, name, relative_to=here_dir)
from main_package import main
app = main(conf.global_conf, **conf.local_conf)
#--------------------------------------------------------------------------
you need to make view for that action and then run it using:
paster request development.ini /url_to_your_view

Categories