I need to test Django application behavior for concurrent requests. I need to test is database data correct after that. As a cnclusion, I need to test and transactions mechanism. So let's use TransactionTestCase for that.
I spawned requests to database using threading and got 'DatabaseError: no such table: app_modelname exception' in threads because of automatic switching to in-memory database type (for SQLite by Django when running test).
Of course, I can specify 'TEST_NAME' for 'default' key in settings.DATABASES, and test will pass as expected. But this also used for all tests too, so tests running takes a much more time.
I thought about custom database router, but it seems to be very
hackish way to do that since I need to patch / mock a lot of
attributes in models to define is this executed in
TransactionTestCase or not.
There was also idea to use
override_settings but unfortunately it does not work (see
issue for details)
How I can specify (or create) not in-memory database (on hard drive) for just a few test cases (TransactionTestCase) and left others to run with in-memory database?
Any thoughts, ideas or code samples will be appreciated.
Thanks!
Related
TL;DR:
How do I prevent DB access issues when calling a PostgreSQL database from multiple threads in Python using SQLAlchemy?
The details:
I am developing a Python software that uses Multithreading (concurrent.futures ThreadPool) - but I am by no means an expert in anything.
I also use SQLAlchemy to communicate with a PostgreSQL database (using pg8000).
Because I wanted to keep all my database stuff separate from all the rest, all the SQLAlchemy code sits in a Python module that I called db_manager.py. In here you will find the declarative base, the create_engine() call but also loads of methods to get stuff or store stuff in the database. Each method here ends with:
session.commit()
(unless I just query the database).
Each thread then would call the db_manager module to interact with the database, e.g.:
db_manager.getSomethingFromDB(...)
I created a little drawing to illustrate the architecture.
The problem:
Now the problem I run into is that these database calls seem to clash sometimes.
What is the best way of dealing with multithreading, SQLAlchemy, and PostgreSQL?
Some ideas:
Currently, my db_manager accesses the PostgreSQL as a specific user (pg8000 appears to require this). Is that a problem? Should each thread be its own user? Or can this not be causing problems? If each thread needs to be its own database user, I would probably no longer be able to have all my database stuff in one single module.
I failed to define rollbacks for each commit. I just noticed this is causing problems this any error prevents any further database access.
We are try to cover tests on an old, big project which has more than 500 tables in database, and that waste too much time on database creating(more than 1 hour on my RMBP) and db migrations.
We are using PostgreSQL as the Database, cause some GIS about service needs it, so it's hard to use SQLite to replace it.
What I can do to decrease the time on testing preparation?
You can use django-nose and reuse the database like this:
REUSE_DB=1 ./manage.py test
Be careful that your tests do not leave any junk in the DB. Have a look at the documentation for more info.
At some point I ended up creating a transaction management middleware that would intercept transaction calls so that all tests were run in a transaction, and then the transaction was rolled back at the end.
Another alternative is to have a binary database dump that gets loaded at the beginning of each test, and then the database is dropped and recreated between tests. After creating a good database, use xtrabackup to create a dump of it. Then, in a per-test setup function, drop and create the database, then use xtrabackup to load the dump. Since it's a binary dump it'll load fairly quickly.
I'm using fixtures with SQLAlchemy to create some integration tests.
I'd like to put SQLAlchemy into a "never commit" mode to prevent changes ever being written to the database, so that my tests are completely isolated from each other. Is there a way to do this?
My initial thoughts are that perhaps I could replace Session.commit with a mock object; however I'm not sure if there are other things that might have the same effect that I also need to mock if I'm going to go down that route.
The scoped session manager will by default return the same session object for each connection. Accordingly, one can replace .commit with .flush, and have that change persist across invocations to the session manager.
That will prevent commits.
To then rollback all changes, one should use session.transaction.rollback().
i want to log some information into mongodb using python . i found 2 libraries mongodblog and log4mongo for python. any idea which one is better ? or any other library which is better than these ?
When you use MongoDB for logging, the concern is the lock contention by high write throughputs. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert() causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from Python program.
Fluentd: Data Import from Python Applications
Here's some tutorials about MongoDB + Fluentd.
Fluentd + MongoDB: The Easiest Way to Log Your Data Effectively on 10gen blog
Fluentd: Store Apache Logs into MongoDB
No need to use a logging library. Use pymongo and do the following:
create a different database from your application database (it can be on the same machine) to avoid problems with high write throughput hogging the lock that the rest of your application may need.
if there is a ton of logging to be done, consider using a capped collection
if you need to be analyzing the log as it occurs, write another script that uses a tailable cursor: http://docs.mongodb.org/manual/tutorial/create-tailable-cursor/
The upshot is that all of your logging needs can be taken care of with a few lines of code. Again, no need to complicate your code base by introducing extra dependencies when a bit of code will suffice.
As mentioned by other users here, it is quite simple to log directly using pymongo:
from pymongo import MongoClient
from pymongo import ASCENDING
import datetime
client = MongoClient()
db = client.my_logs
log_collection = db.log
log_collection.ensure_index([("timestamp", ASCENDING)])
def log(msg):
"""Log `msg` to MongoDB log"""
entry = {}
entry['timestamp'] = datetime.datetime.utcnow()
entry['msg'] = msg
log_collection.insert(entry)
log('Log messages like this')
You might want to experiment by replacing the _id with the timestamp, just remember that _id has to be unique.
You can use Mongolog or Log4Mongo. Both of them have log appender for python logging package. You can easily instantiate your log handler (mongo log handler) and add it (the handler) to your logger. Rest of the thing will be handled out of the box. Both of them also support capped collection (could be useful in case of huge records of logs(specially junk longs) )
Log4Mongo : https://pypi.org/project/log4mongo/
Github page : https://github.com/log4mongo/log4mongo-python
MongoLog : https://pypi.org/project/mongolog/#description
Github page : https://github.com/puentesarrin/mongodb-log
Here is the author of one of the libraries. I can't say much about the other library.
I agree that for really a lot of logs you should not necessarily use mongodb directly (the accepted answer does a fair job of explaining what should be used.
However, for medium size (medium in the sense of traffic and amount of logs) applications where a complex setup might be undesirable you can use BufferedMongoHandler, this logging class is designed to solve exactly that locking problem.
It does that by collecting the messages and writing them periodically instead of immidieatly. Take a look at the code, it is pretty straight forward.
IMO, if you already use mongodb, and you feel comfortable with it, it's an OK solution.
The project I'm working on is a business logic software wrapped up as a Python package. The idea is that various script or application will import it, initialize it, then use it.
It currently has a top level init() method that does the initialization and sets up various things, a good example is that it sets up SQLAlchemy with a db connection and stores the SA session for later access. It is being stored in a subpackage of my project (namely myproj.model.Session, so other code could get a working SA session after import'ing the model).
Long story short, this makes my package a stateful one. I'm writing unit tests for the project and this stafeful behaviour poses some problems:
tests should be isolated, but the internal state of my package breaks this isolation
I cannot test the main init() method since its behavior depends on the state
future tests will need to be run against the (not yet written) controller part with a well known model state (eg. a pre-populated sqlite in-memory db)
Should I somehow refactor my package because the current structure is not the Best (possible) Practice(tm)? :)
Should I leave it at that and setup/teardown the whole thing every time? If I'm going to achieve complete isolation that'd mean fully erasing and re-populating the db at every single test, isn't that overkill?
This question is really on the overall code & tests structure, but for what it's worth I'm using nose-1.0 for my tests. I know the Isolate plugin could probably help me but I'd like to get the code right before doing strange things in the test suite.
You have a few options:
Mock the database
There are a few trade offs to be aware of.
Your tests will become more complex as you will have to do the setup, teardown and mocking of the connection. You may also want to do verification of the SQL/commands sent. It also tends to create an odd sort of tight coupling which may cause you to spend additonal time maintaining/updating tests when the schema or SQL changes.
This is usually the purest for of test isolation because it reduces a potentially large dependency from testing. It also tends to make tests faster and reduces the overhead to automating the test suite in say a continuous integration environment.
Recreate the DB with each Test
Trade offs to be aware of.
This can make your test very slow depending on how much time it actually takes to recreate your database. If the dev database server is a shared resource there will have to be additional initial investment in making sure each dev has their own db on the server. The server may become impacted depending on how often tests get runs. There is additional overhead to running your test suite in a continuous integration environment because it will need at least, possibly more dbs (depending on how many branches are being built simultaneously).
The benefit has to do with actually running through the same code paths and similar resources that will be used in production. This usually helps to reveal bugs earlier which is always a very good thing.
ORM DB swap
If your using an ORM like SQLAlchemy their is a possibility that you can swap the underlying database with a potentially faster in-memory database. This allows you to mitigate some of the negatives of both the previous options.
It's not quite the same database as will be used in production, but the ORM should help mitigate the risk that obscures a bug. Typically the time to setup an in-memory database is much shorter that one which is file-backed. It also has the benefit of being isolated to the current test run so you don't have to worry about shared resource management or final teardown/cleanup.
Working on a project with a relatively expensive setup (IPython), I've seen an approach used where we call a get_ipython function, which sets up and returns an instance, while replacing itself with a function which returns a reference to the existing instance. Then every test can call the same function, but it only does the setup for the first one.
That saves doing a long setup procedure for every test, but occasionally it creates odd cases where a test fails or passes depending on what tests were run before. We have ways of dealing with that - a lot of the tests should do the same thing regardless of the state, and we can try to reset the object's state before certain tests. You might find a similar trade-off works for you.
Mock is a simple and powerfull tool to achieve some isolation. There is a nice video from Pycon2011 which shows how to use it. I recommend to use it together with py.test which reduces the amount of code required to define tests and is still very, very powerfull.