Fetching 100 Results at a Time in Google App Engine - python

I was hoping someone could explain to me how to use offsets or cursors in App Engine. I'm using gcloud to remote access entities for a huge data migration, and would like to grab data in batches of 100.
I'm guessing there is a very simple way to do this, but the documentation doesn't dive into cursors all too much. Here is what I have so far:
client = datastore.Client(dataset_id=projectID)
# retrieve 100 Articles
query = client.query(kind='Article').fetch(100)
for article in query:
print article
How could I mark the end of that batch of 100 and then move into the next one? Thanks so much!
Edit:
I should mention that I do not have access to the app engine environment, which is why I'm a bit lost at the moment... :(

I don't have any experience with gcloud, but I don't think this should be too different.
When you query, you will use fetch_page instead of the fetch function. The fetch_page function returns three things (results, cursor, more). The cursor is your bookmark for the query, and more is true if there are probably more results.
Once you've handled your 100 entities, you can pass on the cursor in urlsafe form to your request handler's URI, where you will continue the process starting at the new cursor.
from google.appengine.datastore.datastore_query import Cursor
class printArticles(webapp2.RequestHandler):
def post(self):
query = Client.query()
#Retrieve the cursor.
curs = Cursor(urlsafe=self.request.get('cursor'))
#fetch_page returns three things
articles, next_curs, more = query.fetch_page(100, start_cursor=curs)
#do whatever you need to do
for article in articles:
print article
#If there are more results to fetch
if more == True and next_curs is not None:
#then pass along the cursor
self.redirect("/job_URI?cursor=" + next_curs.urlsafe())

Related

SQLAlchemy ORM complex filters

I wrote an API that was directly using psycopg2 to interface with a PostgreSQL database but decided to re-write to use SQLAlchemy ORM. For the most part I'm very happy with the transition, but there are a few of trickier things I did that have been tough to translate. I was able to build a query to do what I wanted, but I'd much rather handle it with a HybridProperty/HybridMethod or perhaps a Custom Comparator (tried both but couldn't get them to work). I'm fairly new to SQLAlchemy ORM so I'd be happy to explore all options but I'd prefer something in the database model rather than the API code.
For background, there are several loosely coupled API consumers that have a few mandatory identifying columns that belong to the API and then a LargeBinary column that they can basically do whatever they want with. In the code below, the API consumers need to be able to select messages based on their own identifiers that are not parsed by the API (since each consumer is likely different).
Old code:
select_sql = sql.SQL("""SELECT {field1}
FROM {table}
WHERE {field2}={f2}
AND {field3}={f3}
AND {field4}={f4}
AND left(encode({f5}, 'hex'), {numChar})={selectBytes} # Relevant clause
AND {f6}=false;
"""
).format(field1=sql.Identifier("key"),
field2=sql.Identifier("f2"),
field3=sql.Identifier("f3"),
field4=sql.Identifier("f4"),
field5=sql.Identifier("f5"),
field6=sql.Identifier("f6"),
numChar=sql.Literal(len(data['bytes'])),
table=sql.Identifier("incoming"),
f2=sql.Literal(data['2']),
selectBytes=sql.Literal(data['bytes']),
f3=sql.Literal(data['3']),
f1=sql.Literal(data['1']))
try:
cur = incoming_conn.cursor()
cur.execute(select_sql)
keys = [x[0] for x in cur.fetchall()]
cur.close()
return keys, 200
except psycopg2.DatabaseError as error:
logging.error(error)
incoming_conn.reset()
return "Error reading from DB", 500
New code:
try:
session = Session()
messages = (
session.query(IncomingMessage)
.filter_by(deleted=False)
.filter_by(f2=data['2'])
.filter_by(f3=data['3'])
.filter_by(f4=data['4'])
.filter(func.left(func.encode(IncomingMessage.payload, # Relevant clause
'hex'),
len(data['bytes'])) == data['bytes'])
)
keys = [x.key for x in messages]
session.close()
return keys, 200
except exc.OperationalError as error:
logging.error(error)
session.close()
return "Database failure", 500
The problem that I kept running into was how to limit the number of stored bytes that were being compared. I don't think it's really a problem in the Comparator, but I feel like there would be a performance cost if I were loading several megabytes just to compare the first eight or so bytes.

simple example of working with neo4j python driver?

Is there a simple example of working with the neo4j python driver?
How do I just pass cypher query to the driver to run and return a cursor?
If I'm reading for example this it seems the demo has a class wrapper, with a private member func I pass to the session.write,
session.write_transaction(self._create_and_return_greeting, ...
That then gets called it with a transaction as a first parameter...
def _create_and_return_greeting(tx, message):
that in turn runs the cypher
result = tx.run("CREATE (a:Greeting) "
This seems 10X more complicated than it needs to be.
I did just try a simpler:
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
But this results in a socket error on the query, probably because the session goes out of scope?
[dfcx/__init__] ERROR | Underlying socket connection gone (_ssl.c:2396)
[dfcx/__init__] ERROR | Failed to write data to connection IPv4Address(('neo4j-core-8afc8558-3.production-orch-0042.neo4j.io', 7687)) (IPv4Address(('34.82.120.138', 7687)))
Also I can't return a cursor/iterator, just the data()
When the session goes out of scope, the query result seems to die with it.
If I manually open and close a session, then I'd have the same problems?
Python must be the most popular language this DB is used with, does everyone use a different driver?
Py2neo seems cute, but completely lacking in ORM wrapper function for most of the cypher language features, so you have to drop down to raw cypher anyway. And I'm not sure it supports **kwargs argument interpolation in the same way.
I guess that big raise should help iron out some kinks :D
Slightly longer version trying to get a working DB wrapper:
def neo_connect() -> Union[neo4j.BoltDriver, neo4j.Neo4jDriver]:
global raw_driver
if raw_driver:
# print('reuse driver')
return raw_driver
neoconfig = NEOCONFIG
raw_driver = neo4j.GraphDatabase.driver(
neoconfig['url'], auth=(
neoconfig['user'], neoconfig['pass']))
if raw_driver is None:
raise BaseException("cannot connect to neo4j")
else:
return raw_driver
def raw_query(query, **kwargs):
# just get data, no cursor
neodriver = neo_connect()
session = neodriver.session()
# logging.info('neoquery %s', query)
# with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
data = result.data()
return data
except neo4j.exceptions.CypherSyntaxError as err:
logging.error('neo error %s', err)
logging.error('failed query: %s', query)
raise err
# finally:
# logging.info('close session')
# session.close()
update: someone pointed me to this example which is another way to use the tx wrapper.
https://github.com/neo4j-graph-examples/northwind/blob/main/code/python/example.py#L16-L21
def raw_query(query, **kwargs):
neodriver = neo_connect() # cached dbconn
with neodriver.session() as session:
try:
result = session.run(query, **kwargs)
return result.data()
This is perfectly fine and works as intended on my end.
The error you're seeing is stating that there is a connection problem. So there must be something going on between the server and the driver that's outside of its influence.
Also, please note, that there is a difference between all of these ways to run a query:
with driver.session():
result = session.run("<SOME CYPHER>")
def work(tx):
result = tx.run("<SOME CYPHER>")
with driver.session():
session.write_transaction(work)
The latter one might be 3 lines longer and the team working on the drivers collected some feedback regarding this. However, there are more things to consider here. Firstly, changing the API surface is something that needs careful planning and cannot be done in say a patch release. Secondly, there are technical hurdles to overcome. Here are the semantics, anyway:
Auto-commit transaction. Runs only that query as one unit of work.
If you run a new auto-commit transaction within the same session, the previous result will buffer all available records for you (depending on the query, this will consume a lot of memory). This can be avoided by calling result.consume(). However, if the session goes out of scope, the result will be consumed automatically. This means you cannot extract further records from it. Lastly, any error will be raised and needs handling in the application code.
Managed transaction. Runs whatever unit of work you want within that function. A transaction is implicitly started and committed (unless you rollback explicitly) around the function.
If the transaction ends (end of function or rollback), the result will be consumed and become invalid. You'll have to extract all records you need before that.
This is the recommended way of using the driver because it will not raise all errors but handle some internally (where appropriate) and retry the work function (e.g. if the server is only temporarily unavailable). Since the function might be executed multiple time, you must make sure it's idempotent.
Closing thoughts:
Please remember that stackoverlfow is monitored on a best-effort basis and what can be perceived as hasty comments may get in the way of getting helpful answers to your questions

How to use multiple functions in twisted deferToThread?

According to documentation on twisted deferToThread (http://twistedmatrix.com/documents/current/api/twisted.internet.threads.deferToThread.html) I can give it one function and it's arguments.
I want to limit the output of the documents I will find (say to 3 documents), so I also want to use the limit function from MongoDB driver (pymongo).
"find" creates a PyMongo Cursor, and does no more work. "find" does not send a message to the MongoDB server, and it does not retrieve any results. The work does not begin unless you iterate the cursor like this:
for doc in cursor:
print(doc)
Or:
all_docs = list(cursor)
So the way you're doing it is already wrong: you're deferring to a thread the work of creating a Cursor, which does not need to be deferred because it doesn't do network I/O. But you're then using the cursor on the main thread, which you do need to defer.
So I propose something like:
def find_all():
# find_one() actually does network I/O
doc1 = self.mongo_pool.database[collection].find_one(self.my_id)
# creating a cursor does no I/O
cursor = self.mongo_pool.database[collection].find().limit(3)
# calling list() on a cursor does network I/O
return doc1, list(cursor)
stuff_deferred = deferToThread(find_all)

How to get and delete record simultaneously in SqlAlchemy?

Some processes at the same time read table. Each process takes on one task. Is it possbile don't use LOCK table in this case ?
db.session.execute('LOCK TABLE "Task"')
query = db.session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
db.session.delete(row)
db.session.commit()
By locking table you use pessimistic approach to concurrency.
Alterntively, intead of locking the table, you can be optimistic about the things going the right way. I would wrap the code to retrieve a task to work on in a continues retry statement with error handling in case the commit fails because some other process already removed this very task this process tried to get.
Something like this, perhaps:
def get_next_task():
session = ...
task = None
while not(task):
try:
query = session.query(models.Task).order_by(models.Task.ordr).limit(1)
for row in query:
task = row
session.delete(row)
session.commit()
if not(task):
return # no more tasks found
except TODO_FIND_PROPER_EXCEPTION_TO_HANDLE as _exc:
pass # or log the statement
# maybe need to make_transient
return task
Whether this solution is better will depend on the use case, though.

should I reuse the cursor in the python MySQLdb module

I'm writing a python CGI script that will query a MySQL database. I'm using the MySQLdb module. Since the database will be queryed repeatedly, I wrote this function....
def getDatabaseResult(sqlQuery,connectioninfohere):
# connect to the database
vDatabase = MySQLdb.connect(connectioninfohere)
# create a cursor, execute and SQL statement and get the result as a tuple
cursor = vDatabase.cursor()
try:
cursor.execute(sqlQuery)
except:
cursor.close()
return None
result = cursor.fetchall()
cursor.close()
return result
My question is... Is this the best practice? Of should I reuse my cursor within my functions? For example. Which is better...
def callsANewCursorAndConnectionEachTime():
result1 = getDatabaseResult(someQuery1)
result2 = getDatabaseResult(someQuery2)
result3 = getDatabaseResult(someQuery3)
result4 = getDatabaseResult(someQuery4)
or do away with the getDatabaseeResult function all together and do something like..
def reusesTheSameCursor():
vDatabase = MySQLdb.connect(connectionInfohere)
cursor = vDatabase.cursor()
cursor.execute(someQuery1)
result1 = cursor.fetchall()
cursor.execute(someQuery2)
result2 = cursor.fetchall()
cursor.execute(someQuery3)
result3 = cursor.fetchall()
cursor.execute(someQuery4)
result4 = cursor.fetchall()
The MySQLdb developer recommends building an application specific API that does the DB access stuff for you so that you don't have to worry about the mysql query strings in the application code. It'll make the code a bit more extendable (link).
As for the cursors my understanding is that the best thing is to create a cursor per operation/transaction. So some check value -> update value -> read value type of transaction could use the same cursor, but for the next one you would create a new one. This is again pointing to the direction of building an internal API for the db access instead of having a generic executeSql method.
Also remember to close your cursors, and commit changes to the connection after the queries are done.
Your getDatabaseResult function doesn't need to have a connect for every separate query though. You can share the connection between the queries as long as you act responsible with the cursors.

Categories