python sqlalchemy parallel operation

python sqlalchemy parallel operation - python

HI，i got a multi-threading program which all threads will operate on oracle
DB. So, can sqlalchemy support parallel operation on oracle?
tks!

OCI (oracle client interface) has a parameter OCI_THREADED which has the effect of connections being mutexed, such that concurrent access via multiple threads is safe. This is likely the setting the document you saw was referring to.
cx_oracle, which is essentially a Python->OCI bridge, provides access to this setting in its connection function using the keyword argument "threaded", described at http://cx-oracle.sourceforge.net/html/module.html#cx_Oracle.connect . The docs state that it is False by default due to its resulting in a "10-15% performance penalty", though no source is given for this information (and performance stats should always be viewed suspiciously as a rule).
As far as SQLAlchemy, the cx_oracle dialect provided with SQLAlchemy sets this value to True by default, with the option to set it back to False when setting up the engine via create_engine() - so at that level there's no issue.
But beyond that, SQLAlchemy's recommended usage patterns (i.e. one Session per thread, keeping connections local to a pool where they are checked out by a function as needed) prevent concurrent access to a connection in any case. So you can likely turn off the "threaded" setting on create_engine() and enjoy the possibly-tangible performance increases provided regular usage patterns are followed.

As long as each concurrent thread has it's own session you should be fine. Trying to use one shared session is where you'll get into trouble.

Related

Two flask apps using one database

Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.

SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.

Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.

Are transactions in SQLAlchemy thread safe?

I am developing a web app using SQLAlchemy's expression language, not its orm. I want to use multiple threads in my app, but I'm not sure about thread safety. I am using this section of the documentation to make a connection. I think this is thread safe because I reference a specific connection in each request. Is this thread safe?

The docs for connections and sessions say that neither is thread safe or intended to be shared between threads.
The Connection object is not thread-safe. While a Connection can be shared among threads using properly synchronized access, it is still possible that the underlying DBAPI connection may not support shared access between threads. Check the DBAPI documentation for details.
The Session is very much intended to be used in a non-concurrent fashion, which usually means in only one thread at a time.
The Session should be used in such a way that one instance exists for a single series of operations within a single transaction.
The bigger point is that you should not want to use the session with multiple concurrent threads.
There is no guarantee when using the same connection (and transaction context) in more than one thread that the behavior will be correct or consistent.
You should use one connection or session per thread. If you need guarantees about the data, you should set the isolation level for the engine or session. For web applications, SQLAlchemy suggests using one connection per request cycle.
This simple correspondence of web request and thread means that to associate a Session with a thread implies it is also associated with the web request running within that thread, and vice versa, provided that the Session is created only after the web request begins and torn down just before the web request ends.

I think you are confusing atomicity with isolation.
Atomicity is usually handled through transactions and concerns integrity.
Isolation is about concurrent read/write to a database table (thus thread safety). For example: if you want to increment an int field of a table's record, you will have to select the record's field, increment the value and update it. If multiple threads are doing this concurrently the result will depend on the order of the reads/writes.
http://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=isolation#engine-creation-api

About refreshing objects in sqlalchemy session

I am dealing with a doubt about sqlalchemy and objects refreshing!
I am in the situation in what I have 2 sessions, and the same object has been queried in both sessions! For some particular thing I cannot to close one of the sessions.
I have modified the object and commited the changes in session A, but in session B, the attributes are the initial ones! without modifications!
Shall I implement a notification system to communicate changes or there is a built-in way to do this in sqlalchemy?

Sessions are designed to work like this. The attributes of the object in Session B WILL keep what it had when first queried in Session B. Additionally, SQLAlchemy will not attempt to automatically refresh objects in other sessions when they change, nor do I think it would be wise to try to create something like this.
You should be actively thinking of the lifespan of each session as a single transaction in the database. How and when sessions need to deal with the fact that their objects might be stale is not a technical problem that can be solved by an algorithm built into SQLAlchemy (or any extension for SQLAlchemy): it is a "business" problem whose solution you must determine and code yourself. The "correct" response might be to say that this isn't a problem: the logic that occurs with Session B could be valid if it used the data at the time that Session B started. Your "problem" might not actually be a problem. The docs actually have an entire section on when to use sessions, but it gives a pretty grim response if you are hoping for a one-size-fits-all solution...
A Session is typically constructed at the beginning of a logical
operation where database access is potentially anticipated.
The Session, whenever it is used to talk to the database, begins a
database transaction as soon as it starts communicating. Assuming the
autocommit flag is left at its recommended default of False, this
transaction remains in progress until the Session is rolled back,
committed, or closed. The Session will begin a new transaction if it
is used again, subsequent to the previous transaction ending; from
this it follows that the Session is capable of having a lifespan
across many transactions, though only one at a time. We refer to these
two concepts as transaction scope and session scope.
The implication here is that the SQLAlchemy ORM is encouraging the
developer to establish these two scopes in his or her application,
including not only when the scopes begin and end, but also the expanse
of those scopes, for example should a single Session instance be local
to the execution flow within a function or method, should it be a
global object used by the entire application, or somewhere in between
these two.
The burden placed on the developer to determine this scope is one area
where the SQLAlchemy ORM necessarily has a strong opinion about how
the database should be used. The unit of work pattern is specifically
one of accumulating changes over time and flushing them periodically,
keeping in-memory state in sync with what’s known to be present in a
local transaction. This pattern is only effective when meaningful
transaction scopes are in place.
That said, there are a few things you can do to change how the situation works:
First, you can reduce how long your session stays open. Session B is querying the object, then later you are doing something with that object (in the same session) that you want to have the attributes be up to date. One solution is to have this second operation done in a separate session.
Another is to use the expire/refresh methods, as the docs show...
# immediately re-load attributes on obj1, obj2
session.refresh(obj1)
session.refresh(obj2)
# expire objects obj1, obj2, attributes will be reloaded
# on the next access:
session.expire(obj1)
session.expire(obj2)
You can use session.refresh() to immediately get an up-to-date version of the object, even if the session already queried the object earlier.

Run this, to force session to update latest value from your database of choice:
session.expire_all()
Excellent DOC about default behavior and lifespan of session

I just had this issue and the existing solutions didn't work for me for some reason. What did work was to call session.commit(). After calling that, the object had the updated values from the database.

TL;DR Rather than working on Session synchronization, see if your task can be reasonably easily coded with SQLAlchemy Core syntax, directly on the Engine, without the use of (multiple) Sessions
For someone coming from SQL and JDBC experience, one critical thing to learn about SQLAlchemy, which, unfortunately, I didn't clearly come across reading through the multiple documents for months is that SQLAlchemy consists of two fundamentally different parts: the Core and the ORM. As the ORM documentation is listed first on the website and most examples use the ORM-like syntax, one gets thrown into working with it and sets them-self up for errors and confusion - if thinking about ORM in terms of SQL/JDBC. ORM uses its own abstraction layer that takes a complete control over how and when actual SQL statements are executed. The rule of thumb is that a Session is cheap to create and kill, and it should never be re-used for anything in the program's flow and logic that may cause re-querying, synchronization or multi-threading. On the other hand, the Core is the direct no-thrills SQL, very much like a JDBC Driver. There is one place in the docs I found that "suggests" using Core over ORM:
it is encouraged that simple SQL operations take place here, directly on the Connection, such as incrementing counters or inserting extra rows within log
tables. When dealing with the Connection, it is expected that Core-level SQL
operations will be used; e.g. those described in SQL Expression Language Tutorial.
Although, it appears that using a Connection causes the same side effect as using a Session: re-query of a specific record returns the same result as the first query, even if the record's content in the DB was changed. So, apparently Connections are as "unreliable" as Sessions to read DB content in "real time", but a direct Engine execution seems to be working fine as it picks a Connection object from the pool (assuming that the retrieved Connection would never be in the same "reuse" state relatively to the query as the specific open Connection). The Result object should be closed explicitly, as per SA docs

What is your isolation level is set to?
SHOW GLOBAL VARIABLES LIKE 'transaction_isolation';
By default mysql innodb transaction_isolation is set to REPEATABLE-READ.
+-----------------------+-----------------+
| Variable_name | Value |
+-----------------------+-----------------+
| transaction_isolation | REPEATABLE-READ |
+-----------------------+-----------------+
Consider setting it to READ-COMMITTED.
You can set this for your sqlalchemy engine only via:
create_engine("mysql://<connection_string>", isolation_level="READ COMMITTED")
I think another option is:
engine = create_engine("mysql://<connection_string>")
engine.execution_options(isolation_level="READ COMMITTED")
Or set it globally in the DB via:
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
and
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#setting-transaction-isolation-levels-dbapi-autocommit

If u had added the incorrect model to the session, u can do:
db.session.rollback()

Pymongo, connection pooling and asynchronous tasks via Celery

I'm using pymongo to access mongodb in an application that also uses Celery to perform many asynchronous tasks. I know pymongo's connection pooling does not support asynchronous workers (based on the docs).
To access collections I've got a Collection class wrapping certain logic that fits my application. I'm trying to make sense of some code that I inherited with this wrapper:
Each collection at the moment creates its own Connection instance. Based on what I'm reading this is wrong and I should really have a single Connection instance (in settings.py or such) and import it into my Collection instances. That bit is clear. Is there a guideline as far as the maximum connections recommended? The current code surely creates a LOT of connections/sockets as its not really making use of the pooling facilities.
However, as some code is called from both asynchronous celery tasks as well as being run synchronously, I'm not sure how to handle this. My thought is to instantiate new Connection instances for the tasks and use the single one for for the synchronous ones (ending_request of course after each activity is done). Is this the right direction?
Thanks!
Harel

From pymongo's docs : "PyMongo is thread-safe and even provides built-in connection pooling for threaded applications."
The word "asynchronous" in your situation can be translated into how "inconsistent" requirements your application has.
Statements like "x += 1" will never be consistent in your app. If you can afford this, there is no problem. If you have "critical" operations you must somehow implement some locks for synchronization.
As for the maximum connections, I don't know exact numbers, so test and proceed.
Also take a look at Redis and this example, if speed and memory efficiency are required. From some benchmarks I made, Redis python driver is at least 2x faster than pymongo, for reads/writes.

Python sqlite3 and concurrency

I have a Python program that uses the "threading" module. Once every second, my program starts a new thread that fetches some data from the web, and stores this data to my hard drive. I would like to use sqlite3 to store these results, but I can't get it to work. The issue seems to be about the following line:
conn = sqlite3.connect("mydatabase.db")
If I put this line of code inside each thread, I get an OperationalError telling me that the database file is locked. I guess this means that another thread has mydatabase.db open through a sqlite3 connection and has locked it.
If I put this line of code in the main program and pass the connection object (conn) to each thread, I get a ProgrammingError, saying that SQLite objects created in a thread can only be used in that same thread.
Previously I was storing all my results in CSV files, and did not have any of these file-locking issues. Hopefully this will be possible with sqlite. Any ideas?

Contrary to popular belief, newer versions of sqlite3 do support access from multiple threads.
This can be enabled via optional keyword argument check_same_thread:
sqlite.connect(":memory:", check_same_thread=False)

You can use consumer-producer pattern. For example you can create queue that is shared between threads. First thread that fetches data from the web enqueues this data in the shared queue. Another thread that owns database connection dequeues data from the queue and passes it to the database.

The following found on mail.python.org.pipermail.1239789
I have found the solution. I don't know why python documentation has not a single word about this option. So we have to add a new keyword argument to connection function
and we will be able to create cursors out of it in different thread. So use:
sqlite.connect(":memory:", check_same_thread = False)
works out perfectly for me. Of course from now on I need to take care
of safe multithreading access to the db. Anyway thx all for trying to help.

Switch to multiprocessing. It is much better, scales well, can go beyond the use of multiple cores by using multiple CPUs, and the interface is the same as using python threading module.
Or, as Ali suggested, just use SQLAlchemy's thread pooling mechanism. It will handle everything for you automatically and has many extra features, just to quote some of them:
SQLAlchemy includes dialects for SQLite, Postgres, MySQL, Oracle, MS-SQL, Firebird, MaxDB, MS Access, Sybase and Informix; IBM has also released a DB2 driver. So you don't have to rewrite your application if you decide to move away from SQLite.
The Unit Of Work system, a central part of SQLAlchemy's Object Relational Mapper (ORM), organizes pending create/insert/update/delete operations into queues and flushes them all in one batch. To accomplish this it performs a topological "dependency sort" of all modified items in the queue so as to honor foreign key constraints, and groups redundant statements together where they can sometimes be batched even further. This produces the maxiumum efficiency and transaction safety, and minimizes chances of deadlocks.

You shouldn't be using threads at all for this. This is a trivial task for twisted and that would likely take you significantly further anyway.
Use only one thread, and have the completion of the request trigger an event to do the write.
twisted will take care of the scheduling, callbacks, etc... for you. It'll hand you the entire result as a string, or you can run it through a stream-processor (I have a twitter API and a friendfeed API that both fire off events to callers as results are still being downloaded).
Depending on what you're doing with your data, you could just dump the full result into sqlite as it's complete, cook it and dump it, or cook it while it's being read and dump it at the end.
I have a very simple application that does something close to what you're wanting on github. I call it pfetch (parallel fetch). It grabs various pages on a schedule, streams the results to a file, and optionally runs a script upon successful completion of each one. It also does some fancy stuff like conditional GETs, but still could be a good base for whatever you're doing.

Or if you are lazy, like me, you can use SQLAlchemy. It will handle the threading for you, (using thread local, and some connection pooling) and the way it does it is even configurable.
For added bonus, if/when you realise/decide that using Sqlite for any concurrent application is going to be a disaster, you won't have to change your code to use MySQL, or Postgres, or anything else. You can just switch over.

You need to use session.close() after every transaction to the database in order to use the same cursor in the same thread not using the same cursor in multi-threads which cause this error.

Use threading.Lock()

I could not find any benchmarks in any of the above answers so I wrote a test to benchmark everything.
I tried 3 approaches
Reading and writing sequentially from the SQLite database
Using a ThreadPoolExecutor to read/write
Using a ProcessPoolExecutor to read/write
The results and takeaways from the benchmark are as follows
Sequential reads/sequential writes work the best
If you must process in parallel, use the ProcessPoolExecutor to read in parallel
Do not perform any writes either using the ThreadPoolExecutor or using the ProcessPoolExecutor as you will run into database locked errors and you will have to retry inserting the chunk again
You can find the code and complete solution for the benchmarks in my SO answer HERE Hope that helps!

Scrapy seems like a potential answer to my question. Its home page describes my exact task. (Though I'm not sure how stable the code is yet.)

I would take a look at the y_serial Python module for data persistence: http://yserial.sourceforge.net
which handles deadlock issues surrounding a single SQLite database. If demand on concurrency gets heavy one can easily set up the class Farm of many databases to diffuse the load over stochastic time.
Hope this helps your project... it should be simple enough to implement in 10 minutes.

I like Evgeny's answer - Queues are generally the best way to implement inter-thread communication. For completeness, here are some other options:
Close the DB connection when the spawned threads have finished using it. This would fix your OperationalError, but opening and closing connections like this is generally a No-No, due to performance overhead.
Don't use child threads. If the once-per-second task is reasonably lightweight, you could get away with doing the fetch and store, then sleeping until the right moment. This is undesirable as fetch and store operations could take >1sec, and you lose the benefit of multiplexed resources you have with a multi-threaded approach.

You need to design the concurrency for your program. SQLite has clear limitations and you need to obey them, see the FAQ (also the following question).

Please consider checking the value of THREADSAFE for the pragma_compile_options of your SQLite installation. For instance, with
SELECT * FROM pragma_compile_options;
If THREADSAFE is equal to 1, then your SQLite installation is threadsafe, and all you gotta do to avoid the threading exception is to create the Python connection with checksamethread equal to False. In your case, it means
conn = sqlite3.connect("mydatabase.db", checksamethread=False)
That's explained in some detail in Python, SQLite, and thread safety

The most likely reason you get errors with locked databases is that you must issue
conn.commit()
after finishing a database operation. If you do not, your database will be write-locked and stay that way. The other threads that are waiting to write will time-out after a time (default is set to 5 seconds, see http://docs.python.org/2/library/sqlite3.html#sqlite3.connect for details on that).
An example of a correct and concurrent insertion would be this:
import threading, sqlite3
class InsertionThread(threading.Thread):
def __init__(self, number):
super(InsertionThread, self).__init__()
self.number = number
def run(self):
conn = sqlite3.connect('yourdb.db', timeout=5)
conn.execute('CREATE TABLE IF NOT EXISTS threadcount (threadnum, count);')
conn.commit()
for i in range(1000):
conn.execute("INSERT INTO threadcount VALUES (?, ?);", (self.number, i))
conn.commit()
# create as many of these as you wish
# but be careful to set the timeout value appropriately: thread switching in
# python takes some time
for i in range(2):
t = InsertionThread(i)
t.start()
If you like SQLite, or have other tools that work with SQLite databases, or want to replace CSV files with SQLite db files, or must do something rare like inter-platform IPC, then SQLite is a great tool and very fitting for the purpose. Don't let yourself be pressured into using a different solution if it doesn't feel right!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.