SqlAlchemy and Multiprocessing

SqlAlchemy and Multiprocessing - python

I use SqlAlchemy to connect to my database backend and make heavy use of multiprocessing in my Python application. I came to a situation which requires to pass an object reference, which is the result of a database query, from one process to another.
This is a problem, because when accessing an attribute of the object, SqlAlchemy trys to reattach the object into the current session of the other process, which fails with an exception, because the object is attached in an other session:
InvalidRequestError: Object '<Field at 0x9af3e4c>' is already attached to session '148848780' (this is '159831148')
What is the way to handle this situation? Is it possible to detach the object from the first session or clone the object without the ORM related stuff?

This is a bad idea (tm).
You shouldn't share a stateful object between processes like this (I know it's tempting) because all kinds of bad things can happen since lock primitives are not intended to work across multiple python runtimes.
I suggest taking the attributes you need out of that object, jamming them into a dict and
sending it across processes using multprocessing Pipes:
http://docs.python.org/library/multiprocessing.html#pipes-and-queues

Related

Is there a way I can persist context locals for sub-threads?

Currently I create a library that records backend calls like ones made to boto3 and requests libraries, and then populates a global "data" object based on some data like the status code of responses, etc.
I originally had the data object as global, but then I realized this was a bad idea because when the application is run in parallel, the data object is simultaneously modified (which would possibly corrupt it), however I want to keep this object separate for each invocation of my application.
So I looked into Flask context locals, similar to how it does for its global "request" object. I manage to implement a way using LocalProxy how they did it, so it works fine now with parallel requests to my application - the issue now though, is that whenever the application spawns a new sub-thread it creates an entirely new context and thus I can't retrieve the data object from its parent thread, e.g. for that request session - basically I need to copy and modify the same data object that is local to the main thread for that particular application request.
To clarify, I was able to do this when I previously had data as a true "global" object - multiple sub-threads could properly modify the same object. However, it did not handle the case for simultaneous requests made to application, as I mentioned; so I manage to fix that, but now the sub-threads are not able to modify the same data object any more *sad face*
I looked at some solutions like below, but this did not help me because the decorator approach only works for "local" functions. Since the functions that I need to decorate are "global" functions like requests.request that threads across various application requests will use, I think I need to use another approach where I can temporarily copy same thread context to use in sub-threads (and my understanding is it should not overwrite or decorate the function, as this is a "global" one that will be use by simultaneous requests to application). Would appreciate any help or possible ideas how I can make this work for my use-case.
Thanks.
Flask throwing 'working outside of request context' when starting sub thread

AttributeError when using memcache.gets()

I am trying to use memcached with Google App Engine. I import the library using
from google.appengine.api import memcache
and then call it using
posts = memcache.gets("posts")
Then I get the following error:
AttributeError: 'module' object has no attribute 'gets'
I have looked through the Google App Engine documentation regarding memcache, but I can't find any examples using memcache.gets(). Memcache.get() seems to be used the way I call gets above.

gets is a method of the memcache client object, not a module-level function of memcache. The module-level functions are quite simple, stateless, and synchronous; using the client object, you can do more advanced stuff, if you have to, as documented at https://cloud.google.com/appengine/docs/python/memcache/clientclass .
Specifically, per the docs at https://cloud.google.com/appengine/docs/python/memcache/clientclass#Client_gets , "You use" gets "rather than get if you want to avoid conditions in which two or more callers are trying to modify the same key value at the same time, leading to undesired overwrites." since gets also gets (and stashes in the client object) the cas_id which lets you use the cas (compare-and-set) call (you don't have to explicitly handle the cas_id yourself).
Since it doesn't seem you're attempting a compare-and-set operation, I would recommend using the simpler module-level function get, rather than instantiating a client object and using its instance method gets.

If you actually do need to compare and set, a very good explanation can be found here:
The Client object is required because the gets() operation actually
squirrels away some hidden information that is used by the subsequent
cas() operation. Because the memcache functions are stateless (meaning
they don't alter any global values), these operations are only
available as methods on the Client object, not as functions in the
memcache module. (Apart from these two, the methods on the Client
object are exactly the same as the functions in the module, as you can
tell by comparing the documentation.)
The solution would be to use the class:
client = memcache.Client()
posts = client.gets("posts")
...
client.cas("posts", "new_value")
Although, of course, you would need more than that for cas to be useful.

pymongo - Serialize/pickle Connection or Database object

I want to write a custom OutputWriter for GAE's Mapreduce framework. This OutputWriter should open a direct tcp connection to an open MongoDB port, and write the results of the reduce step directly to this database.
I'm using pymongo to interact with mongodb. The existing Mapreduce library requires output writers to be JSON serializable. Once the output writer has thus established a connection with the mongodb instance like so:
from pymongo import Connection
conn = Connection(host=MONGODB_HOST, port=MONGODB_PORT)
db = conn.test_db
db.authenticate(MONGODB_USERNAME, MONGODB_PASSWD)
I'd like to either serialize Connection (of type pymongo.connection.Connection) or db itself (a pymongo.database.Database). Naturally, those objects aren't JSON serializable, so I thought I could just make a JSON dict with a pickled database inside, but it seems that pymongo doesn't natively support pickling these objects, i.e. neither has a __getstate__ method.
I assume I could simply store the connection and authentication parameters, and reopen a connection when the OutputWriter is deserialized, but that seems overly hacky and time and resource intensive.
Can someone point me to a workaround, or perhaps a different kind of serialization I haven't thought of?

I assume I could simply store the connection and authentication parameters, and reopen a connection when the OutputWriter is deserialized, but that seems overly hacky and time and resource intensive.
What else would you expect to be able to do? A database connection is, in general, a wrapper around some objects that lives outside of Python (sockets, file handles, instances of opaque objects created by a C library, etc.), so there's no way to just store one and restore it in a later instance of the process, pass it to a different process, etc. So, any general-purpose serialization for a class like this would have to work by storing the connection parameters and re-connecting.
But there are many cases where you wouldn't want to do that. (Also, remember that making something pickleable also makes it copyable, and it's far from clear that you'd always want to copy a database connection by opening a new distinct but equivalent connection.) Which is why most database connection objects and similar things are not pickleable.
Meanwhile, if you're trying to pass these around within a process, while the connection is still alive… then you shouldn't be pickling them in the first place, just pass references to the connection around.
So anyway, I'd suggest you do exactly what you suggested but didn't want to do, but wrap it up by subclassing (or monkeypatching) the two classes so they can be pickled directly, instead of passing a bunch of separate parameters around and making everyone else have to know how to deal with it.
I don't think __getstate__ will work here. That would imply that you can make a database connection by default-constructing the instance and then setting attributes or calling methods after the fact, but most database connection classes require you to pass arguments into the constructor call to be used at __new__ or __init__ time. You could probably do this with just __getnewargs__ (which is actually even simpler than __getstate__), however. If not, you'll need the more complex __reduce__ mechanism.

About refreshing objects in sqlalchemy session

I am dealing with a doubt about sqlalchemy and objects refreshing!
I am in the situation in what I have 2 sessions, and the same object has been queried in both sessions! For some particular thing I cannot to close one of the sessions.
I have modified the object and commited the changes in session A, but in session B, the attributes are the initial ones! without modifications!
Shall I implement a notification system to communicate changes or there is a built-in way to do this in sqlalchemy?

Sessions are designed to work like this. The attributes of the object in Session B WILL keep what it had when first queried in Session B. Additionally, SQLAlchemy will not attempt to automatically refresh objects in other sessions when they change, nor do I think it would be wise to try to create something like this.
You should be actively thinking of the lifespan of each session as a single transaction in the database. How and when sessions need to deal with the fact that their objects might be stale is not a technical problem that can be solved by an algorithm built into SQLAlchemy (or any extension for SQLAlchemy): it is a "business" problem whose solution you must determine and code yourself. The "correct" response might be to say that this isn't a problem: the logic that occurs with Session B could be valid if it used the data at the time that Session B started. Your "problem" might not actually be a problem. The docs actually have an entire section on when to use sessions, but it gives a pretty grim response if you are hoping for a one-size-fits-all solution...
A Session is typically constructed at the beginning of a logical
operation where database access is potentially anticipated.
The Session, whenever it is used to talk to the database, begins a
database transaction as soon as it starts communicating. Assuming the
autocommit flag is left at its recommended default of False, this
transaction remains in progress until the Session is rolled back,
committed, or closed. The Session will begin a new transaction if it
is used again, subsequent to the previous transaction ending; from
this it follows that the Session is capable of having a lifespan
across many transactions, though only one at a time. We refer to these
two concepts as transaction scope and session scope.
The implication here is that the SQLAlchemy ORM is encouraging the
developer to establish these two scopes in his or her application,
including not only when the scopes begin and end, but also the expanse
of those scopes, for example should a single Session instance be local
to the execution flow within a function or method, should it be a
global object used by the entire application, or somewhere in between
these two.
The burden placed on the developer to determine this scope is one area
where the SQLAlchemy ORM necessarily has a strong opinion about how
the database should be used. The unit of work pattern is specifically
one of accumulating changes over time and flushing them periodically,
keeping in-memory state in sync with what’s known to be present in a
local transaction. This pattern is only effective when meaningful
transaction scopes are in place.
That said, there are a few things you can do to change how the situation works:
First, you can reduce how long your session stays open. Session B is querying the object, then later you are doing something with that object (in the same session) that you want to have the attributes be up to date. One solution is to have this second operation done in a separate session.
Another is to use the expire/refresh methods, as the docs show...
# immediately re-load attributes on obj1, obj2
session.refresh(obj1)
session.refresh(obj2)
# expire objects obj1, obj2, attributes will be reloaded
# on the next access:
session.expire(obj1)
session.expire(obj2)
You can use session.refresh() to immediately get an up-to-date version of the object, even if the session already queried the object earlier.

Run this, to force session to update latest value from your database of choice:
session.expire_all()
Excellent DOC about default behavior and lifespan of session

I just had this issue and the existing solutions didn't work for me for some reason. What did work was to call session.commit(). After calling that, the object had the updated values from the database.

TL;DR Rather than working on Session synchronization, see if your task can be reasonably easily coded with SQLAlchemy Core syntax, directly on the Engine, without the use of (multiple) Sessions
For someone coming from SQL and JDBC experience, one critical thing to learn about SQLAlchemy, which, unfortunately, I didn't clearly come across reading through the multiple documents for months is that SQLAlchemy consists of two fundamentally different parts: the Core and the ORM. As the ORM documentation is listed first on the website and most examples use the ORM-like syntax, one gets thrown into working with it and sets them-self up for errors and confusion - if thinking about ORM in terms of SQL/JDBC. ORM uses its own abstraction layer that takes a complete control over how and when actual SQL statements are executed. The rule of thumb is that a Session is cheap to create and kill, and it should never be re-used for anything in the program's flow and logic that may cause re-querying, synchronization or multi-threading. On the other hand, the Core is the direct no-thrills SQL, very much like a JDBC Driver. There is one place in the docs I found that "suggests" using Core over ORM:
it is encouraged that simple SQL operations take place here, directly on the Connection, such as incrementing counters or inserting extra rows within log
tables. When dealing with the Connection, it is expected that Core-level SQL
operations will be used; e.g. those described in SQL Expression Language Tutorial.
Although, it appears that using a Connection causes the same side effect as using a Session: re-query of a specific record returns the same result as the first query, even if the record's content in the DB was changed. So, apparently Connections are as "unreliable" as Sessions to read DB content in "real time", but a direct Engine execution seems to be working fine as it picks a Connection object from the pool (assuming that the retrieved Connection would never be in the same "reuse" state relatively to the query as the specific open Connection). The Result object should be closed explicitly, as per SA docs

What is your isolation level is set to?
SHOW GLOBAL VARIABLES LIKE 'transaction_isolation';
By default mysql innodb transaction_isolation is set to REPEATABLE-READ.
+-----------------------+-----------------+
| Variable_name | Value |
+-----------------------+-----------------+
| transaction_isolation | REPEATABLE-READ |
+-----------------------+-----------------+
Consider setting it to READ-COMMITTED.
You can set this for your sqlalchemy engine only via:
create_engine("mysql://<connection_string>", isolation_level="READ COMMITTED")
I think another option is:
engine = create_engine("mysql://<connection_string>")
engine.execution_options(isolation_level="READ COMMITTED")
Or set it globally in the DB via:
SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html
and
https://docs.sqlalchemy.org/en/14/orm/session_transaction.html#setting-transaction-isolation-levels-dbapi-autocommit

If u had added the incorrect model to the session, u can do:
db.session.rollback()

Are there any concerns I should have about storing a Python Lock object in a Beaker session?

There is a certain page on my website where I want to prevent the same user from visiting it twice in a row. To prevent this, I plan to create a Lock object (from Python's threading library). However, I would need to store that across sessions. Is there anything I should watch out for when trying to store a Lock object in a session (specifically a Beaker session)?

Storing a threading.Lock instance in a session (or anywhere else that needs serialization) is a terrible idea, and presumably you'll get an exception if you try to (since such an object cannot be serialized, e.g., it cannot be pickled). A traditional approach for cooperative serialization of processes relies on file locking (on "artificial" files e.g. in a directory such as /tmp/locks/<username> if you want the locking to be per-user, as you indicate). I believe the wikipedia entry does a good job of describing the general area; if you tell us what OS you're running order, we might suggest something more specific (unfortunately I do not believe there is a cross-platform solution for this).

I just realized that this was a terrible question since locking a lock and saving it to the session takes two steps thus defeating the purpose of the lock's atomic actions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.