I'm trying to fetch more then 1200 records from a query in GAE but this fails, just keeps loading forever. I can see in the debug that it keeps doing GET requests to Google but I never see any results
This works perfectly:
for lcr in deactivation_list.fetch(1200,offset=0, batch_size=1000):
This keeps loading:
for lcr in deactivation_list.fetch(1201,offset=0, batch_size=1000):
Tried increasing the batchsize, didnt help. I'm using NDB models
The only solution I have found is to use cursors, as suggested in the comments earlier. The reason is that the Google Remote API has a 1mb limit, using cursors you can query it multiple times.
recordQuery = model.query()
record, cursor, more = recordQuery.fetch_page(1000)
while more:
record, cursor, more = recordQuery.fetch_page(1000, start_cursor=cursor)
Related
I have a MongoDB pod with millions of records in k8s and using AsyncIOMotorClient to iterate over the cursor with 1000 batch size. However, after a few iterations, the fetching got stuck and I can see the Slow query message from MongoDB logs.
The iteration is like below after getting collection:
cursor = collection.aggregate_raw_batches(pipeline=pipeline, batchSize=c['QUERY_BATCH_SIZE'])
async for batch in cursor:
...
Even I tried to paginate the fetched records by adding $skip $limit into pipeline, but the same result appears again.
I also tried to track resource usage such as CPU or memory in Kubernetes but everything looks fine except MongoDB itself.
Example log of MongoDB:
{"t":{"$date":"2022-03-24T13:02:41.042+00:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn1214224","msg":"Slow query","attr":{"type":"command","ns":"product.feed_1648113467","command":{"getMore":4946197083994343333,"collection":"feed_1648113467","batchSize":1000,"lsid":{"id":{"$uuid":"bec21ebb-f305-4438-8e1d-16badfc22bd3"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1648126959,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"product"},"originatingCommand":{"aggregate":"feed_1648113467","pipeline":[{"$match":{"_disapproved":{"$ne":true},"_gpla_feed_name":"detske-zbozi/kocarky","_delivery.direct":true,"$or":[{"_restricted":{"$ne":true}},{"_restriction_exception_for_google":true}]}},{"$limit":1000000},{"$project":{"_id":0,"g:description":1,"g:id":1,"title":"$g:name","g:price":{"$concat":[{"$toString":"$g:price_rounded"}," CZK"]},"g:brand":"$g:producer_name","g:image_link":"$g:image_url","g:additional_image_link":1,"link":"$g:url","g:availability":"in stock","g:condition":"new","g:material":1,"g:color":1,"g:gender":1,"g:size":1,"g:product_type":1,"g:google_product_category":1,"g:energy_efficiency_class":1,"g:unit_pricing_measure":1,"g:unit_pricing_base_measure":1,"g:gtin":"$g:ean","g:adult":"$_restricted","g:product_detail":1}}],"cursor":{"batchSize":0},"lsid":{"id":{"$uuid":"bec21ebb-f305-4438-8e1d-16badfc22bd3"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1648126959,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"product","$readPreference":{"mode":"primaryPreferred"}},"planSummary":"IXSCAN { _disapproved: 1, _gpla_feed_name: 1 }","cursorid":4946197083994343333,"keysExamined":861,"docsExamined":859,"cursorExhausted":true,"numYields":8,"nreturned":960,"reslen":642671,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":9}},"Global":{"acquireCount":{"r":9}},"Database":{"acquireCount":{"r":9}},"Collection":{"acquireCount":{"r":9}},"Mutex":{"acquireCount":{"r":1}}},"storage":{"data":{"bytesRead":21129204,"timeReadingMicros":112254}},"protocol":"op_msg","durationMillis":138}}
Thanks for any help
I am reading over 100 million records from mongodb and creating nodes and relationships in neo4j.
whenever I run this after executing certain records I am getting pymongo.errors.CursorNotFound: cursor id "..." not found
earlier when I was executing it without "no_cursor_timeout=True" in the mongodb query then at every 64179 records I was getting the same error but after looking for this on StackOverflow I had tried this adding no_cursor_timeout=True but now also at 2691734 value I am getting the same error. HOW CAN I GET RID OF THIS ERROR I had also tried by defining the batch size.
Per the ticket Belly Buster mentioned you should try:
manually specifying the session to use with all your operations, and
periodically pinging the server using that session id to keep it alive on the server
I have a caching problem when I use sqlalchemy.
I use sqlalchemy to insert data into a MySQL database. Then, I have another application process this data, and update it directly.
But sqlalchemy always returns the old data rather than the updated data. I think sqlalchemy cached my request ... so ... how should I disable it?
The usual cause for people thinking there's a "cache" at play, besides the usual SQLAlchemy identity map which is local to a transaction, is that they are observing the effects of transaction isolation. SQLAlchemy's session works by default in a transactional mode, meaning it waits until session.commit() is called in order to persist data to the database. During this time, other transactions in progress elsewhere will not see this data.
However, due to the isolated nature of transactions, there's an extra twist. Those other transactions in progress will not only not see your transaction's data until it is committed, they also can't see it in some cases until they are committed or rolled back also (which is the same effect your close() is having here). A transaction with an average degree of isolation will hold onto the state that it has loaded thus far, and keep giving you that same state local to the transaction even though the real data has changed - this is called repeatable reads in transaction isolation parlance.
http://en.wikipedia.org/wiki/Isolation_%28database_systems%29
This issue has been really frustrating for me, but I have finally figured it out.
I have a Flask/SQLAlchemy Application running alongside an older PHP site. The PHP site would write to the database and SQLAlchemy would not be aware of any changes.
I tried the sessionmaker setting autoflush=True unsuccessfully
I tried db_session.flush(), db_session.expire_all(), and db_session.commit() before querying and NONE worked. Still showed stale data.
Finally I came across this section of the SQLAlchemy docs: http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html#transaction-isolation-level
Setting the isolation_level worked great. Now my Flask app is "talking" to the PHP app. Here's the code:
engine = create_engine(
"postgresql+pg8000://scott:tiger#localhost/test",
isolation_level="READ UNCOMMITTED"
)
When the SQLAlchemy engine is started with the "READ UNCOMMITED" isolation_level it will perform "dirty reads" which means it will read uncommited changes directly from the database.
Hope this helps
Here is a possible solution courtesy of AaronD in the comments
from flask.ext.sqlalchemy import SQLAlchemy
class UnlockedAlchemy(SQLAlchemy):
def apply_driver_hacks(self, app, info, options):
if "isolation_level" not in options:
options["isolation_level"] = "READ COMMITTED"
return super(UnlockedAlchemy, self).apply_driver_hacks(app, info, options)
Additionally to zzzeek excellent answer,
I had a similar issue. I solved the problem by using short living sessions.
with closing(new_session()) as sess:
# do your stuff
I used a fresh session per task, task group or request (in case of web app). That solved the "caching" problem for me.
This material was very useful for me:
When do I construct a Session, when do I commit it, and when do I close it
This was happening in my Flask application, and my solution was to expire all objects in the session after every request.
from flask.signals import request_finished
def expire_session(sender, response, **extra):
app.db.session.expire_all()
request_finished.connect(expire_session, flask_app)
Worked like a charm.
I have tried session.commit(), session.flush() none worked for me.
After going through sqlalchemy source code, I found the solution to disable caching.
Setting query_cache_size=0 in create_engine worked.
create_engine(connection_string, convert_unicode=True, echo=True, query_cache_size=0)
First, there is no cache for SQLAlchemy.
Based on your method to fetch data from DB, you should do some test after database is updated by others, see whether you can get new data.
(1) use connection:
connection = engine.connect()
result = connection.execute("select username from users")
for row in result:
print "username:", row['username']
connection.close()
(2) use Engine ...
(3) use MegaData...
please folowing the step in : http://docs.sqlalchemy.org/en/latest/core/connections.html
Another possible reason is your MySQL DB is not updated permanently. Restart MySQL service and have a check.
As i know SQLAlchemy does not store caches, so you need to looking at logging output.
I need to save data for a session in Django and perform some action when the user clicks on a button. I am storing the data processed after a query in Django sessions. This was working well and good on my local server even after I tried hitting the server simultaneously from different sessions at the same time. However, when pushed to prod, this shows key error at /url/ the second time onward when I hit the site. The data serves fine on the first go.
I looked up some solutions and have tried adding SESSION_ENGINE as "django.contrib.sessions.backends.cached_db". I have added SESSION_SAVE_EVERY_REQUEST = True in settings.py. I also tried saving data for every session key separately, which did not work either.
I am saving the data to sessions like this:
request.session['varname'] = varname
and retrieving it the same way: python varname = request.session['varname']
Expected behavior would be successful retrieval of session data every time like on a local server. However, on prod, the data is not retrieved after the first time.
i have doing access log to a MySQL table, but recently it became too much for MySQL. Then, i decided to save in Google BigQuery. I don't know if it is the better option, but it seems to viable. Anyone has comments about that? Okay...
I started to integrate to Google BigQuery, i made an small application with Flask (a Python framework). I created endpoints to receive data and send to BigQuery. Now my general application sends data to a URL which is pointed to my Flask application, that for your turn, sends to BigQuery. Any observation or suggestion here?
Finally my problem, sometimes i'm losing data. I made an script to test my general application to see the results, i ran the script it for many times and noticed that i lost some data, because sometimes the same data are being saved and sometimes not. Someone has some idea what can be happening? And most important.. How can i prevent to lose data in that case? How my application can be prepared to notice that data wasn't seved to Google BigQuery and then treat it, like to try again?
I am using google-cloud-python library (reference: https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html#tables).
My code:
client = bigquery.Client(project=project_id)
table_ref = client.dataset(dataset_id).table(table_id)
SCHEMA = [SchemaField(**field) for field in schema]
errors = client.create_rows(table_ref, [row], SCHEMA)
That is all
As I expected, you don't handle errors. Make sure you handle and understand how streaming insert works. If you stream 1000 rows, and 56 fail, you get that back, and you need to retry only 56 rows. Also insertId is important.
Streaming Data into BigQuery