How to get PostgresHook Airflow Dict Cursor - python

Postgreshook in airflow has a function get_record that returns the result of a query.
The result is a tuple object. How can we receive the result in a dict form?
Also since the code mentions here that we can use a dict cursor how do we do that?

The directions given in your link are technically accurate, but unfortunately not very clear as to what "connection" you should be working with.
Modify the relevant Connection in the Airflow web interface's Connections manager so that its Extra field contains the following JSON: {"cursor": "dictcursor"}.

Related

SQLAlchemy: use related object when session is closed

I have many models with relational links to each other which I have to use. My code is very complicated so I cannot keep session alive after a query. Instead, I try to preload all the objects:
def db_get_structure():
with Session(my_engine) as session:
deps = {x.id: x for x in session.query(Department).all()}
...
return (deps, ...)
def some_logic(id):
struct = db_get_structure()
return some_other_logic(struct.deps[id].owner)
However, I get the following error anyway regardless of the fact that all the objects are already loaded:
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <Department at 0x10476e780> is not bound to a Session; lazy load operation of attribute 'owner' cannot proceed
Is it possible to link preloaded objects with each other so that the relations will work after session get closed?
I know about joined queries (.options(joinedload(), but this approach leads to more code lines and bigger DB request, and I think this should be solved simpler, because all the objects are already loaded into Python objects.
It's even possible now to request the related objects like struct.deps[struct.deps[id].owner_id], but I think the ORM should do this and provide shorter notation struct.deps[id].owner using some "cached load".
Whenever you access an attribute on a DB entity that has not yet been loaded from the DB, SQLAlchemy will issue an implicit SQL statement to the DB to fetch that data. My guess is that this is what happens when you issue struct.deps[struct.deps[id].owner_id].
If the object in question has been removed from the session it is in a "detached" state and SQLAlchemy protects you from accidentally running into inconsistent data. In order to work with that object again it needs to be "re-attached".
I've done this already fairly often with session.merge:
attached_object = new_session.merge(detached_object)
But this will reconile the object instance with the DB and potentially issue updates to the DB if necessary. The detached_object is taken as "truth".
I believe you can do the reverse (attaching it by reading from the DB instead of writing to it) by using session.refresh(detached_object), but I need to verify this. I'll update the post if I found something.
Both ways have to talk to the DB with at least a select to ensure the data is consistent.
In order to avoid loading, issue session.merge(..., load=False). But this has some very important cavetas. Have a look at the docs of session.merge() for details.
I will need to read up on your link you added concerning your "complicated code". I would like to understand why you need to throw away your session the way you do it. Maybe there is an easier way?

Firestore query takes a too long time to get the value of only one field

. Hi, community.
I have a question/issue about firestore query from Firebase.
I have a collection of around 18000 documents. I would like to get the value of a single same field of some of these documents. I use the python firestore_v1 library from google-cloud-python client. So, for example with list_edges.length = 250:
[db_firestore.document(f"edges/{edge['id']}").get({"distance"}).to_dict()["distance"] for edge in list_edges]
it takes like 30+ seconds to be evaluated, meanwhile with the equal collection on MongoDB it takes not more than 3 seconds doing this and loading the whole object, not only a one field:
list(db_mongo["edges"].find({"city_id":{"$eq":city_id},"id": {"$in": [edge_id for edge in list_edges]}}))
...having said that, I thought the solution could be separate the large collection by city_id, so I create a new collection and copy the corresponded documents inside, so now the query looks like:
[db_firestore.document(f"edges/7/edges/{edge['id']}").get({"distance"}).to_dict()["distance"] for edge in list_edges]
where 7 is a city_id.
However, it takes the same time. So, maybe the issue is around the .get() method, but I could not find any optimized solution for my case.
Could you help me with this? Thanks!
EDITED
I've got the answer from firestore support. The problem is that I make 250 requests doing .get() for each document separately. The idea is to get all the data I want in only one request, so I need to modify the query.
Let's assume I have the next DB:
edges collection with multiples edge_id documents. For each new request, I use a new generated list of edges I need to catch.
In MongoDB, I can do it with the $in operator (having edge_id inside the document), but in firestore, the 'in' operator only accepts up to 10 equality.
So, I need to find out another way to do this.
Any ideas? Thanks!
Firebase recently added support for a limited in operation. See:
The blog post announcing the feature.
The documentation on in and array-contains-any queries.
From the latter:
cities_ref = db.collection(u'cities')
query = cities_ref.where(u'country', u'in', [u'USA', u'Japan'])
A few caveats though:
You can have at most 10 values in the in clause, and you can have only on in (or array-contains-any) clause in query.
I am not sure if you can use this operator to select by ID.

Is it a bad practice to iterate over a Flask SQLAlchemy object using __dict__?

To build a dynamic user update route on Flask, I have iterate over the user Flask SQLAlchemy object using the dunder __dict__:
parameters = ['name'] # Valid parameters after the Regex filter was applied
for parameter in parameters:
user.__dict__[parameter] = request.form.get(parameter)
I have done this to avoid the usage of ifs. To ensure that only valid parameters are present in parameters, I have applied a Regex pattern that filters the valid parameters received in the request for the user route, and I have documented this aspect on the doc string.
I'm asking if iterate over a Flask SQLAlchemy object using __dict__ is it a bad practice because if I print the user.__dict__, I receive all parameters, even those that aren't on the Regex filter, i.g, password, date created, etc; and should never be updated by this route.
I have found another approach that uses get all columns in SQLAlchemy, but I think that at the end its similar to the approach that I'm using...
Obs: the implemented route can update specific attributes from user or all of them, using the same route
I'd recommend looking into marshmallow-sqlalchemy to manage this sort of thing. I've found that there are very few use-cases where there is a problem, and __dict__ is the best solution.
Here's an example application using it: How do I produce nested JSON from database query with joins? Using Python / SQLAlchemy

What does a GQL Query Return

I have been working on a project using Google App Engine. I have been setting up users and have to check if a username is taken yet.
I used the following code to try to test whether it is taken or not
usernames = db.GqlQuery('select username from User')
taken = username in usernames
This never caught duplicate usernames. I tried a few variants of this on the GQL query line. I tried using .get() which caused an error because it returned something that wasn't iterable. I also tried putting list() around the request, which returned the same error. I tried writing the value of usernames but never got any response. If it returns a query instance, then is there any way to turn it into a list or tuple?
For starters you should revisit the docs https://cloud.google.com/appengine/docs/python/datastore/gqlqueryclass?hl=en
db.GqlQuery('select username from User') is calling a constructor not a function so it returns an instance of a GqlQuery object. See docs referred to above.
Secondly what you are doing will never work reliably due to eventual consistancy . Please read https://cloud.google.com/appengine/docs/python/datastore/structuring_for_strong_consistency to understand why.
Lastly you are starting out with appengine, so move away from db and use ndb unless you have a significant existing code base.

Mongodb: Getting upsert result in pymongo

Mongo returns a WriteResult on upserts:
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 1 })
Is there any way I could access those fields from pymongo? I need this because an update always returns none in pymongo and I want to know if the document I was querying was modified or even if it exists without doing an additional query. Can you please tell me how this could be done?
P.S. I know this has been asked before but it was a few years ago and everything I could found on google didn't include an example.
Since we're at it, is there a way to get fields from the document from the result of an upsert? (or at least the _id)
Solved: As Neil Lunn suggests, the Bulk API is the way to go if you want to get more data out of what happened with your updates. I'd just like to point out this quick walkthrough of the API.
The newer MongoDB shell implementations from MongoDB 2.6 and upwards actually define there shell helper methods for .update() and .insert() etc, using the "Bulk operations API" where this is available.
So basically where the shell is connecting to a a MongoDB 2.6 instance or greater the "Bulk" methods are used "under the hood". Even if they actually are only acting on one document at a time, or otherwise effectively only issuing "one" update request or similar.
The general driver interfaces have not yet caught up with this and you need to still invoke explicitly:
bulk = db.test.initialize_ordered_bulk_op()
bulk.find({}).upsert().update({ "$set" { "this": "that" } }
result = bulk.execute()
The "result" returned here matches the "Bulk Write Result" specification you see in the shell, which is different to how the "legacy" implementations which are currently used in the standard driver methods return.

Categories