I wrote an API that was directly using psycopg2 to interface with a PostgreSQL database but decided to re-write to use SQLAlchemy ORM. For the most part I'm very happy with the transition, but there are a few of trickier things I did that have been tough to translate. I was able to build a query to do what I wanted, but I'd much rather handle it with a HybridProperty/HybridMethod or perhaps a Custom Comparator (tried both but couldn't get them to work). I'm fairly new to SQLAlchemy ORM so I'd be happy to explore all options but I'd prefer something in the database model rather than the API code.
For background, there are several loosely coupled API consumers that have a few mandatory identifying columns that belong to the API and then a LargeBinary column that they can basically do whatever they want with. In the code below, the API consumers need to be able to select messages based on their own identifiers that are not parsed by the API (since each consumer is likely different).
Old code:
select_sql = sql.SQL("""SELECT {field1}
FROM {table}
WHERE {field2}={f2}
AND {field3}={f3}
AND {field4}={f4}
AND left(encode({f5}, 'hex'), {numChar})={selectBytes} # Relevant clause
AND {f6}=false;
"""
).format(field1=sql.Identifier("key"),
field2=sql.Identifier("f2"),
field3=sql.Identifier("f3"),
field4=sql.Identifier("f4"),
field5=sql.Identifier("f5"),
field6=sql.Identifier("f6"),
numChar=sql.Literal(len(data['bytes'])),
table=sql.Identifier("incoming"),
f2=sql.Literal(data['2']),
selectBytes=sql.Literal(data['bytes']),
f3=sql.Literal(data['3']),
f1=sql.Literal(data['1']))
try:
cur = incoming_conn.cursor()
cur.execute(select_sql)
keys = [x[0] for x in cur.fetchall()]
cur.close()
return keys, 200
except psycopg2.DatabaseError as error:
logging.error(error)
incoming_conn.reset()
return "Error reading from DB", 500
New code:
try:
session = Session()
messages = (
session.query(IncomingMessage)
.filter_by(deleted=False)
.filter_by(f2=data['2'])
.filter_by(f3=data['3'])
.filter_by(f4=data['4'])
.filter(func.left(func.encode(IncomingMessage.payload, # Relevant clause
'hex'),
len(data['bytes'])) == data['bytes'])
)
keys = [x.key for x in messages]
session.close()
return keys, 200
except exc.OperationalError as error:
logging.error(error)
session.close()
return "Database failure", 500
The problem that I kept running into was how to limit the number of stored bytes that were being compared. I don't think it's really a problem in the Comparator, but I feel like there would be a performance cost if I were loading several megabytes just to compare the first eight or so bytes.
Related
We have a pyramid web application.
We use SQLAlchemy#1.4 with Zope transactions.
In our application, it is possible for an error to occur during flush as described here which causes any subsequent usage of the SQLAlchemy session to throw a PendingRollbackError. The error which occurs during a flush is unintentional (a bug), and is raised to our exception handling view... which tries to use data from the SQLAlchemy session, which then throws a PendingRollbackError.
Is it possible to "recover" from a PendingRollbackError if you have not framed your transaction management correctly? The SQLAclhemy documentation says to avoid this situation you essentially "just need to do things the right way". Unfortunately, this is a large codebase, and developers don't always follow correct transaction management. This issue is also complicated if savepoints/nested transactions are used.
def some_view():
# constraint violation
session.add_all([Foo(id=1), Foo(id=1)])
session.commit() # Error is raised during flush
return {'data': 'some data'}
def exception_handling_view(): # Wired in via pyramid framework, error ^ enters here.
session.query(... does a query to get some data) # This throws a `PendingRollbackError`
I am wondering if we can do something like the below, but don't understand pyramid + SQLAlchemy + Zope transactions well enough to know the implications (when considering the potential for nested transactions etc).
def exception_handling_view(): # Wired in via pyramid framework, error ^ enters here.
def _query():
session.query(... does a query to get some data)
try:
_query()
except PendingRollbackError:
session.rollback()
_query()
Instead of trying to execute your query, just try to get the connection:
def exception_handling_view():
try:
_ = session.connection()
except PendingRollbackError:
session.rollback()
session.query(...)
session.rollback() only rolls back the innermost transaction, as is usually expected — assuming nested transactions are used intentionally via the explicit session.begin_nested().
You don't have to rollback parent transactions, but if you decide to do that, you can:
while session.registry().in_transaction():
session.rollback()
Is there a simple example of working with the neo4j python driver?
How do I just pass cypher query to the driver to run and return a cursor?
If I'm reading for example this it seems the demo has a class wrapper, with a private member func I pass to the session.write,
session.write_transaction(self._create_and_return_greeting, ...
That then gets called it with a transaction as a first parameter...
def _create_and_return_greeting(tx, message):
that in turn runs the cypher
result = tx.run("CREATE (a:Greeting) "
This seems 10X more complicated than it needs to be.
I did just try a simpler:
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
But this results in a socket error on the query, probably because the session goes out of scope?
[dfcx/__init__] ERROR | Underlying socket connection gone (_ssl.c:2396)
[dfcx/__init__] ERROR | Failed to write data to connection IPv4Address(('neo4j-core-8afc8558-3.production-orch-0042.neo4j.io', 7687)) (IPv4Address(('34.82.120.138', 7687)))
Also I can't return a cursor/iterator, just the data()
When the session goes out of scope, the query result seems to die with it.
If I manually open and close a session, then I'd have the same problems?
Python must be the most popular language this DB is used with, does everyone use a different driver?
Py2neo seems cute, but completely lacking in ORM wrapper function for most of the cypher language features, so you have to drop down to raw cypher anyway. And I'm not sure it supports **kwargs argument interpolation in the same way.
I guess that big raise should help iron out some kinks :D
Slightly longer version trying to get a working DB wrapper:
def neo_connect() -> Union[neo4j.BoltDriver, neo4j.Neo4jDriver]:
global raw_driver
if raw_driver:
# print('reuse driver')
return raw_driver
neoconfig = NEOCONFIG
raw_driver = neo4j.GraphDatabase.driver(
neoconfig['url'], auth=(
neoconfig['user'], neoconfig['pass']))
if raw_driver is None:
raise BaseException("cannot connect to neo4j")
else:
return raw_driver
def raw_query(query, **kwargs):
# just get data, no cursor
neodriver = neo_connect()
session = neodriver.session()
# logging.info('neoquery %s', query)
# with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
data = result.data()
return data
except neo4j.exceptions.CypherSyntaxError as err:
logging.error('neo error %s', err)
logging.error('failed query: %s', query)
raise err
# finally:
# logging.info('close session')
# session.close()
update: someone pointed me to this example which is another way to use the tx wrapper.
https://github.com/neo4j-graph-examples/northwind/blob/main/code/python/example.py#L16-L21
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
This is perfectly fine and works as intended on my end.
The error you're seeing is stating that there is a connection problem. So there must be something going on between the server and the driver that's outside of its influence.
Also, please note, that there is a difference between all of these ways to run a query:
with driver.session():
result = session.run("<SOME CYPHER>")
def work(tx):
result = tx.run("<SOME CYPHER>")
with driver.session():
session.write_transaction(work)
The latter one might be 3 lines longer and the team working on the drivers collected some feedback regarding this. However, there are more things to consider here. Firstly, changing the API surface is something that needs careful planning and cannot be done in say a patch release. Secondly, there are technical hurdles to overcome. Here are the semantics, anyway:
Auto-commit transaction. Runs only that query as one unit of work.
If you run a new auto-commit transaction within the same session, the previous result will buffer all available records for you (depending on the query, this will consume a lot of memory). This can be avoided by calling result.consume(). However, if the session goes out of scope, the result will be consumed automatically. This means you cannot extract further records from it. Lastly, any error will be raised and needs handling in the application code.
Managed transaction. Runs whatever unit of work you want within that function. A transaction is implicitly started and committed (unless you rollback explicitly) around the function.
If the transaction ends (end of function or rollback), the result will be consumed and become invalid. You'll have to extract all records you need before that.
This is the recommended way of using the driver because it will not raise all errors but handle some internally (where appropriate) and retry the work function (e.g. if the server is only temporarily unavailable). Since the function might be executed multiple time, you must make sure it's idempotent.
Closing thoughts:
Please remember that stackoverlfow is monitored on a best-effort basis and what can be perceived as hasty comments may get in the way of getting helpful answers to your questions
I was hoping someone could explain to me how to use offsets or cursors in App Engine. I'm using gcloud to remote access entities for a huge data migration, and would like to grab data in batches of 100.
I'm guessing there is a very simple way to do this, but the documentation doesn't dive into cursors all too much. Here is what I have so far:
client = datastore.Client(dataset_id=projectID)
# retrieve 100 Articles
query = client.query(kind='Article').fetch(100)
for article in query:
print article
How could I mark the end of that batch of 100 and then move into the next one? Thanks so much!
Edit:
I should mention that I do not have access to the app engine environment, which is why I'm a bit lost at the moment... :(
I don't have any experience with gcloud, but I don't think this should be too different.
When you query, you will use fetch_page instead of the fetch function. The fetch_page function returns three things (results, cursor, more). The cursor is your bookmark for the query, and more is true if there are probably more results.
Once you've handled your 100 entities, you can pass on the cursor in urlsafe form to your request handler's URI, where you will continue the process starting at the new cursor.
from google.appengine.datastore.datastore_query import Cursor
class printArticles(webapp2.RequestHandler):
def post(self):
query = Client.query()
#Retrieve the cursor.
curs = Cursor(urlsafe=self.request.get('cursor'))
#fetch_page returns three things
articles, next_curs, more = query.fetch_page(100, start_cursor=curs)
#do whatever you need to do
for article in articles:
print article
#If there are more results to fetch
if more == True and next_curs is not None:
#then pass along the cursor
self.redirect("/job_URI?cursor=" + next_curs.urlsafe())
We have around 1500 sqlite dbs, each has 0 to 20,000,000 records in table (violation) total no of violation records is around 90,000,000.
Each file we generate by running a crawler on the 1500 servers. With this violation table we have some other tables too which we use for further analysis.
To analyze the results we push all these sqlite violation records into postsgres violation table, along with other insertion and other calculation.
Following is the code I use to transfer records,
class PolicyViolationService(object):
def __init__(self, pg_dao, crawler_dao_s):
self.pg_dao = pg_dao
self.crawler_dao_s = crawler_dao_s
self.user_violation_count = defaultdict(int)
self.analyzer_time_id = self.pg_dao.get_latest_analyzer_tracker()
def process(self):
"""
transfer policy violation record from crawler db to analyzer db
"""
for crawler_dao in self.crawler_dao_s:
violations = self.get_violations(crawler_dao.get_violations())
self.pg_dao.insert_rows(violations)
def get_violations(self, violation_records):
for violation in violation_records:
violation = dict(violation.items())
violation.pop('id')
self.user_violation_count[violation.get('user_id')] += 1
violation['analyzer_time_id'] = self.analyzer_time_id
yield PolicyViolation(**violation)
in sqlite dao
==============
def get_violations(self):
result_set = self.db.execute('select * from policyviolations;')
return result_set
in pg dao
=========
def insert_rows(self, rows):
self.session.add_all(rows)
self.session.commit()
This code works but taking very log time. What is the right way to approach this problem. Please suggest, we have been discussing about parallel processing, skip sqlalchemy and some other options. Please suggest us right way.
Thanks in advance!
The fastest way to get these to PostgreSQL is to use the COPY command, outside any SQLAlchemy.
Within SQLAlchemy one must note that the ORM is very slow. It is doubly slow if you have lots of stuff in ORM that you then flush. You could make it faster by doing flushes after 1000 items or so; it would also make sure that the session does not grow too big. However, why just not use SQLAlchemy Core to generate inserts:
ins = violations.insert().values(col1='value', col2='value')
conn.execute(ins)
I have a table in a database, mapped with SQLAlchemy ORM module (I have a "scoped_session" Variable)
I want multiple instances of my program (not just threads, also from several servers) to be able to work on the same table and NOT work on the same data.
so i have coded a manual "row-lock" mechanism to make sure each row is handled in this method i use "Full Lock" on the table while i "row-lock" it:
def instance:
s = scoped_session(sessionmaker(bind=engine)
engine.execute("LOCK TABLES my_data WRITE")
rows = s.query(Row_model).filter(Row_model.condition == 1).filter(Row_model.is_locked == 0).limit(10)
for row in rows:
row.is_locked = 1
row.lock_time = datetime.now()
s.commit()
engine.execute("UNLOCK TABLES")
for row in row:
manipulate_data(row)
row.is_locked = 0
s.commit()
for i in range(10):
t = threading.Thread(target=instance)
t.start()
The problem is that while running some instances, several threads are collapsing and produce this error (each):
sqlalchemy.exc.DatabaseError: (raised as a result of Query-invoked
autoflush; consider using a session.no_autoflush block if this flush
is occurring prematurely) (DatabaseError) 1205 (HY000): Lock wait
timeout exceeded; try restarting transaction 'UPDATE my_daya SET
row_var = 1}
Where is the catch? what makes my DB table to not UNLOCK successfully?
Thanks.
Locks are evil. Avoid them. Things go very bad when errors occur. Especially when you mix sessions with raw SQL statements, like you do.
The beauty of the scoped session is that it wraps a database transaction. This transaction makes the modifications to the database atomic, and also takes care of cleaning up when things go wrong.
Use scoped sessions as follows:
with scoped_session(sessionmaker(bind=engine) as s:
<ORM actions using s>
It may be some work to rewrite your code so that it becomes properly transactional, but it will be worth it! Sqlalchemy has tricks to help you with that.