In the case of sqlite, it is not clear whether we can easily commit right after each dataframe insert. (Assuming that auto-commit is off by default, following the python database wrapping convention).
Using the simplest sqlalchemy api flow ―
db_engine = db.create_engine()
for .....
# slowly compute some_df, takes a lot of time
some_df.to_sql(con = db_engine)
How can we make sure that every .to_sql is committed?
For motivation, imagine the particular use case being that each write reflects the result of a potentially very long computation and we do not want to lose a huge batch of such computations nor any single one of them, in case a machine goes down or in case the python sqlalchemy engine object is garbage collected before all its writes have actually drained in the database.
I believe auto-commit is off by default, and for sqlite, there is no way of changing that in the create_engine command. What might be the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write ― when using the simplistic .to_sql api?
Or must the code be refactored to use a different api flow to accomplish that?
You can set the connection to autocommit by:
db_engine = db_engine.execution_options(autocommit=True)
From https://docs.sqlalchemy.org/en/13/core/connections.html#understanding-autocommit:
The “autocommit” feature is only in effect when no Transaction has otherwise been declared. This means the feature is not generally used with the ORM, as the Session object by default always maintains an ongoing Transaction.
In your code you have not presented any explicit transactions, and so the engine used as the con is in autocommit mode (as implemented by SQLA).
Note that SQLAlchemy implements its own autocommit that is independent from the DB-API driver's possible autocommit / non-transactional features.
Hence the "the simplest, safest way for adding auto-commit behavior ― or explicitly committing after every dataframe write" is what you already had, unless to_sql() emits some funky statements that SQLA does not recognize as data changing operations, which it has not, at least of late.
It might be that the SQLA autocommit feature is on the way out in the next major release, but we'll have to wait and see.
TL;DR:
How do I prevent DB access issues when calling a PostgreSQL database from multiple threads in Python using SQLAlchemy?
The details:
I am developing a Python software that uses Multithreading (concurrent.futures ThreadPool) - but I am by no means an expert in anything.
I also use SQLAlchemy to communicate with a PostgreSQL database (using pg8000).
Because I wanted to keep all my database stuff separate from all the rest, all the SQLAlchemy code sits in a Python module that I called db_manager.py. In here you will find the declarative base, the create_engine() call but also loads of methods to get stuff or store stuff in the database. Each method here ends with:
session.commit()
(unless I just query the database).
Each thread then would call the db_manager module to interact with the database, e.g.:
db_manager.getSomethingFromDB(...)
I created a little drawing to illustrate the architecture.
The problem:
Now the problem I run into is that these database calls seem to clash sometimes.
What is the best way of dealing with multithreading, SQLAlchemy, and PostgreSQL?
Some ideas:
Currently, my db_manager accesses the PostgreSQL as a specific user (pg8000 appears to require this). Is that a problem? Should each thread be its own user? Or can this not be causing problems? If each thread needs to be its own database user, I would probably no longer be able to have all my database stuff in one single module.
I failed to define rollbacks for each commit. I just noticed this is causing problems this any error prevents any further database access.
In my Django (1.9) project I need to construct a table from an expensive JOIN. I would therefore like to store the table in the DB and only redo the query if the tables involved in the JOIN change. As I need the table as a basis for later JOIN operations I definitely want to store it in my database and not in any cache.
The problem I'm facing is that I'm not sure how to determine whether the data in the tables have changed. Connecting to the post_save, post_delete signals of the respective models seems not to be right since the models might be updated in bulk via CSV upload and I don't want the expensive query to be fired each time a new row is imported, because the DB table will change right away. My current approach is to check whether the data has changed every certain time interval, which would be perfectly fine for me. For this purpose I use a new thread, which compares the Checksums of the involved tables (see code below) to run this task. As I'm not really familiar with multi threading, especially on web servers I do not now, whether this is acceptable. My questions therefore:
Is the threading approach acceptable for running this single task?
Would a Distributed Task Queue like Celery be more appropriate?
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
This is my current code:
import threading
from django.apps import apps
from .models import SomeModel
def check_for_table_change():
app_label = SomeModel._meta.app_label
def join():
"""Join the tables and save the resulting table to the DB."""
...
def get_involved_models(app_label):
"""Get all the models that are involved in the join."""
...
involved_models = get_involved_models(app_label)
involved_dbtables = tuple(model._meta.db_table for model in involved_models)
sql = 'CHECKSUM TABLE %s' % ', '.join(involved_dbtables)
old_checksums = None
while(True):
# Get the result of the query as named tuples.
checksums = from_db(sql, fetch_as='namedtuple')
if old_checksums is not None:
# Compare checksums.
for pair in zip(checksums, old_checksums):
if pair[0].Checksum != pair[1].Checksum:
print('db changed, table is rejoined')
join()
break
old_checksums = checksums
time.sleep(60)
check_tables_thread = threading.Thread()
check_tables_thread.run = check_for_table_change
check_tables_thread.start()
I'm grateful for any suggestions.
Materialized Views and Postgresql
If you were on postgresql, you could have used what's known as a Materialized View. Thus you can create a view based on your join and it would exist almost like a real table. This is very different from normal joins where the query needs to be executed each and every time a view is used. Now the bad news. Mysql does not have materialized views.
If you switched to postgresql, you might even find that materialized vies are not needed after all. That's because postgresql can use more than one index per table in queries. Thus your join that seems slow at the moment on mysql might be made to run faster with better use of indexes on Postgresql. Of course this is very dependent on what your structure is like.
Signals vs Triggers
The problem I'm facing is that I'm not sure how to determine whether
the data in the tables have changed. Connecting to the post_save,
post_delete signals of the respective models seems not to be right
since the models might be updated in bulk via CSV upload and I don't
want the expensive query to be fired each time a new row is imported,
because the DB table will change right away.
As you have rightly determined Django signals isn't the right way. This is the sort of task that is best done at the database level. Since you don't have materialized views, this is a job for triggers. However that's a lot of hard work involved (whether you use triggers or signals)
Is the threading approach acceptable for running this single task?
Why not use django as a CLI here? Which effectively means a django script is invoked by a cron or executed by some other mechanism independently of your website.
Would a Distributed Task Queue like Celery be more appropriate?
Very much so. Each time the data changes, you can fire off a task that does the update of the table.
Is there any way to disconnect a signal for a certain time after it is received, so that a bulk upload does not trigger the signal over and over again?
Keyword here is 'TRIGGER' :-)
Alternatives.
Having said all that doing a join and physically populating a table is going to be very very slow if your table grows to even a few thousand rows. This is because you will need an elaborate query to determine which records have changed (unless you used a separate queue for that). You would then need to insert or update the records in the 'join table' generally update/insert is slower than retrieve so as the size of the data goes, this would become progressively worse.
The real solution maybe to optimize your queries and or tables. May I suggest you post a new question with the slow query and also share your table structures?
One of my methods doesn't work when run on atomic context. I want to ask Django if it's running a transaction.
The method can create a thread or a process and saves the result to database. This is a bit odd but there is a huge performance benefit when a process can be used.
I find that especially processes are a bit sketchy with Django. I know that Django will raise an exception if the method chooses to save the results in a process and the method is run on atomic context.
If I can check for an atomic context then I can throw an exception straight away (instead of getting odd errors) or force the method to only create a thread.
I found the is_managed() method but according to this question it's been removed in Django 1.8.
According to this ticket there are a couple ways to detect this: not transaction.get_autocommit() (using a public API) or transaction.get_connection().in_atomic_block (using a private API).
I am dealing with a doubt about sqlalchemy and objects refreshing!
I am in the situation in what I have 2 sessions, and the same object has been queried in both sessions! For some particular thing I cannot to close one of the sessions.
I have modified the object and commited the changes in session A, but in session B, the attributes are the initial ones! without modifications!
Shall I implement a notification system to communicate changes or there is a built-in way to do this in sqlalchemy?
Sessions are designed to work like this. The attributes of the object in Session B WILL keep what it had when first queried in Session B. Additionally, SQLAlchemy will not attempt to automatically refresh objects in other sessions when they change, nor do I think it would be wise to try to create something like this.
You should be actively thinking of the lifespan of each session as a single transaction in the database. How and when sessions need to deal with the fact that their objects might be stale is not a technical problem that can be solved by an algorithm built into SQLAlchemy (or any extension for SQLAlchemy): it is a "business" problem whose solution you must determine and code yourself. The "correct" response might be to say that this isn't a problem: the logic that occurs with Session B could be valid if it used the data at the time that Session B started. Your "problem" might not actually be a problem. The docs actually have an entire section on when to use sessions, but it gives a pretty grim response if you are hoping for a one-size-fits-all solution...
A Session is typically constructed at the beginning of a logical
operation where database access is potentially anticipated.
The Session, whenever it is used to talk to the database, begins a
database transaction as soon as it starts communicating. Assuming the
autocommit flag is left at its recommended default of False, this
transaction remains in progress until the Session is rolled back,
committed, or closed. The Session will begin a new transaction if it
is used again, subsequent to the previous transaction ending; from
this it follows that the Session is capable of having a lifespan
across many transactions, though only one at a time. We refer to these
two concepts as transaction scope and session scope.
The implication here is that the SQLAlchemy ORM is encouraging the
developer to establish these two scopes in his or her application,
including not only when the scopes begin and end, but also the expanse
of those scopes, for example should a single Session instance be local
to the execution flow within a function or method, should it be a
global object used by the entire application, or somewhere in between
these two.
The burden placed on the developer to determine this scope is one area
where the SQLAlchemy ORM necessarily has a strong opinion about how
the database should be used. The unit of work pattern is specifically
one of accumulating changes over time and flushing them periodically,
keeping in-memory state in sync with what’s known to be present in a
local transaction. This pattern is only effective when meaningful
transaction scopes are in place.
That said, there are a few things you can do to change how the situation works:
First, you can reduce how long your session stays open. Session B is querying the object, then later you are doing something with that object (in the same session) that you want to have the attributes be up to date. One solution is to have this second operation done in a separate session.
Another is to use the expire/refresh methods, as the docs show...
# immediately re-load attributes on obj1, obj2
session.refresh(obj1)
session.refresh(obj2)
# expire objects obj1, obj2, attributes will be reloaded
# on the next access:
session.expire(obj1)
session.expire(obj2)
You can use session.refresh() to immediately get an up-to-date version of the object, even if the session already queried the object earlier.
Run this, to force session to update latest value from your database of choice:
session.expire_all()
Excellent DOC about default behavior and lifespan of session
I just had this issue and the existing solutions didn't work for me for some reason. What did work was to call session.commit(). After calling that, the object had the updated values from the database.
TL;DR Rather than working on Session synchronization, see if your task can be reasonably easily coded with SQLAlchemy Core syntax, directly on the Engine, without the use of (multiple) Sessions
For someone coming from SQL and JDBC experience, one critical thing to learn about SQLAlchemy, which, unfortunately, I didn't clearly come across reading through the multiple documents for months is that SQLAlchemy consists of two fundamentally different parts: the Core and the ORM. As the ORM documentation is listed first on the website and most examples use the ORM-like syntax, one gets thrown into working with it and sets them-self up for errors and confusion - if thinking about ORM in terms of SQL/JDBC. ORM uses its own abstraction layer that takes a complete control over how and when actual SQL statements are executed. The rule of thumb is that a Session is cheap to create and kill, and it should never be re-used for anything in the program's flow and logic that may cause re-querying, synchronization or multi-threading. On the other hand, the Core is the direct no-thrills SQL, very much like a JDBC Driver. There is one place in the docs I found that "suggests" using Core over ORM:
it is encouraged that simple SQL operations take place here, directly on the Connection, such as incrementing counters or inserting extra rows within log
tables. When dealing with the Connection, it is expected that Core-level SQL
operations will be used; e.g. those described in SQL Expression Language Tutorial.
Although, it appears that using a Connection causes the same side effect as using a Session: re-query of a specific record returns the same result as the first query, even if the record's content in the DB was changed. So, apparently Connections are as "unreliable" as Sessions to read DB content in "real time", but a direct Engine execution seems to be working fine as it picks a Connection object from the pool (assuming that the retrieved Connection would never be in the same "reuse" state relatively to the query as the specific open Connection). The Result object should be closed explicitly, as per SA docs
What is your isolation level is set to?
SHOW GLOBAL VARIABLES LIKE 'transaction_isolation';
By default mysql innodb transaction_isolation is set to REPEATABLE-READ.
+-----------------------+-----------------+
| Variable_name | Value |
+-----------------------+-----------------+
| transaction_isolation | REPEATABLE-READ |
+-----------------------+-----------------+
Consider setting it to READ-COMMITTED.
You can set this for your sqlalchemy engine only via:
create_engine("mysql://<connection_string>", isolation_level="READ COMMITTED")
I think another option is:
engine = create_engine("mysql://<connection_string>")
engine.execution_options(isolation_level="READ COMMITTED")
Or set it globally in the DB via:
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
and
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#setting-transaction-isolation-levels-dbapi-autocommit
If u had added the incorrect model to the session, u can do:
db.session.rollback()