We are try to cover tests on an old, big project which has more than 500 tables in database, and that waste too much time on database creating(more than 1 hour on my RMBP) and db migrations.
We are using PostgreSQL as the Database, cause some GIS about service needs it, so it's hard to use SQLite to replace it.
What I can do to decrease the time on testing preparation?
You can use django-nose and reuse the database like this:
REUSE_DB=1 ./manage.py test
Be careful that your tests do not leave any junk in the DB. Have a look at the documentation for more info.
At some point I ended up creating a transaction management middleware that would intercept transaction calls so that all tests were run in a transaction, and then the transaction was rolled back at the end.
Another alternative is to have a binary database dump that gets loaded at the beginning of each test, and then the database is dropped and recreated between tests. After creating a good database, use xtrabackup to create a dump of it. Then, in a per-test setup function, drop and create the database, then use xtrabackup to load the dump. Since it's a binary dump it'll load fairly quickly.
Related
I'm using Prefect to automatize my flows (python scripts). Once running, some data get persisted to a postgresql database, problem, the size of pg_data gets rapidely out of hands (~20Gb) and I was wondering if there was a way to reduce the amount of data stored to pg_data when running an agent or if there was a way to automatically clean the directory.
Thanks in advance for your help,
best,
Christian
I assume you are running Prefect Server and you want to clean up the underlying database instance to save space? If so, there are a couple of ways you can clean up the Postgres database:
you can manually delete old records, especially logs from the flow run table using DELETE FROM in SQL,
you can do the same in an automated fashion, e.g. some users have an actual flow that runs on schedule and purges old data from the database,
alternatively, you can use the open-source pg_cron job scheduler for Postgres to schedule such DB administration tasks,
you can also do the same using GraphQL: you would need to query for flow run IDs of "old" flow runs using the flow_run query, and then execute delete_flow_run mutation,
lastly, to be more proactive, you can reduce the number of logs you generate by generally logging less (only logging what's needed) and setting the log level to a lower category, e.g. instead of using DEBUG logs on your agent, switching to INFO should significantly reduce the amount of space consumed by logs in the database.
I have an application which was running very quickly. Let's say it took 10 seconds to run. All it does is read a csv, parse it, and store some of the info in sqlalchemy objects which are written to the database. (We never attempt to read the database, only to write).
After adding a many to many relationship to the entity we are building and relating it to an address entity which we now build, the time to process the file has increased by an order of magnitude. We are doing very little additional work: just instantiating an address and storing it in the relationship collection on our entity using append.
Most of the time appears to be lost in _load_for_state as can be seen in the attached profiling screenshot:
I'm pretty sure this is unnecessary lost time, because it looks like it is trying to do some loading even though we never make any queries of the database (we always instantiate new objects and save them in this app).
Anyone have an idea how to optimize sqlalchemy here?
update
I tried setting SQLALCHEMY_ECHO = True just to see if it is doing a bunch of database reads, or maybe some extra writes. Bizarrely, it only accesses the database itself at the same times it did before (following a db.session.commit()). I'm pretty sure all this extra time is not being spent due to database access.
My colleague run a script that pulls data from the db periodically. He is using the query:
SELECT url, data FROM table LIMIT {} OFFSET {}'.format( OFFSET, PAGE * OFFSET
We use Amazon AURORAS and he has his own slaves server but everytime it touches 98%+
Table have millions of records.
Would it be nice if we go for sqldump instead of SQL queries for fetching data?
The options come in my mind are:
SQL DUMP of selective tables( not sure of benchmark)
Federate tables based on certain reference(date, ID etc)
Thanks
I'm making some fairly big assumptions here, but from
without choking it
I'm guessing you mean that when your colleague runs the SELECT to grab the large amount of data, the database performance drops for all other operations - presumably your primary application - while the data is being prepared for export.
You mentioned SQL Dump so I'm also assuming that this colleague will be satisfied with data that is roughly correct, ie: it doesn't have to be up to the instant transactionally correct data. Just good enough for something like analytics work.
If those assumptions are close, your colleague and your database might benefit from
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
This line of code should be used carefully and almost never in a line of business application but it can help people querying the live database with big queries, as long as you fully understand the implications.
To use it, simply start a transaction and put this line before any queries you run.
The 'choking'
What you are seeing when your colleague runs a large query is record locking. Your database engine is - quite correctly - set up to provide an accurate view of your data, at any point. So, when a large query comes along the database engine first waits for all write locks (transactions) to clear, runs the large query and holds all future write locks until the query has run.
This actually happens for all transactions, but you only really notice it for the big ones.
What READ UNCOMMITTED does
By setting the transaction isolation level to READ UNCOMMITTED, you are telling the database engine that this transaction doesn't care about write locks and to go ahead and read anyway.
This is known as a 'dirty read', in that the long-running query could well read a table with a write lock on it and will ignore the lock. The data actually read could be the data before the write transaction has completed, or a different transaction could start and modify records before this query gets to it.
The data returned from anything with READ UNCOMMITTED is not guaranteed to be correct in the ACID sense of a database engine, but for some use cases it is good enough.
What the effect is
Your large queries magically run faster and don't lock the database while they are running.
Use with caution and understand what it does before you use it though.
MySQL Manual on transaction isolation levels
I need to test Django application behavior for concurrent requests. I need to test is database data correct after that. As a cnclusion, I need to test and transactions mechanism. So let's use TransactionTestCase for that.
I spawned requests to database using threading and got 'DatabaseError: no such table: app_modelname exception' in threads because of automatic switching to in-memory database type (for SQLite by Django when running test).
Of course, I can specify 'TEST_NAME' for 'default' key in settings.DATABASES, and test will pass as expected. But this also used for all tests too, so tests running takes a much more time.
I thought about custom database router, but it seems to be very
hackish way to do that since I need to patch / mock a lot of
attributes in models to define is this executed in
TransactionTestCase or not.
There was also idea to use
override_settings but unfortunately it does not work (see
issue for details)
How I can specify (or create) not in-memory database (on hard drive) for just a few test cases (TransactionTestCase) and left others to run with in-memory database?
Any thoughts, ideas or code samples will be appreciated.
Thanks!
I understand how to save a redis database using bgsave. However, once my database server restarts, how do I tell if a saved database is present and how do I load it into my application. I can tolerate a few minutes of lost data, so I don't need to worry about an AOF, but I cannot tolerate the loss of, say, an hour's worth of data. So doing a bgsave once an hour would work for me. I just don't see how to reload the data back into the database.
If it makes a difference, I am working in Python.
You can stop redis and replace dump.rdb in /var/lib/redis (or whatever file is in the dbfilename variable in your redis.conf). Then start redis again.