pymongo db.collection.update operationFailure - python

I have a large collection of documents which I'm trying to update using the pymongo.update function. I am finding all documents that fall within a certain polygon and updating all the
points found with "update_value".
for element in geomShapeCollection:
db.collectionName.update({"coordinates":{"$geoWithin":{"$geometry":element["geometry_part"]}}}, {"$set":{"Update_key": update_value}}, multi = True, timeout=False)
For smaller collections this command works as expected. In the largest dataset
the command works for 70-80% of the data and then throws the error:
pymongo.errors.OperationFailure: cursor id '428737620678732339' not
valid at server
The pymongo documentation tells me that this is possibly due to a timeout issue.
Cursors in MongoDB can timeout on the server if they’ve been open for
a long time without any operations being performed on them.
Reading through the pymongo documentation, the find() function has a boolean flag for timeout.
find(spec=None, fields=None, skip=0, limit=0, timeout=True, snapshot=False, tailable=False, _sock=None, _must_use_master=False,_is_command=False)
However the update function appears not to have this:
update(spec, document, upsert=False, manipulate=False, safe=False, multi=False)
Is there any way to set this timeout flag for the update function? Is there any way I can change this so that I do not get this OperationFailure error? Am I correct in assuming this is an timeout error as pymongo states that it throws this error when
Raised when a database operation fails.

After some research and lots of experimentation I found that it was the outer loop cursor that was causing the error.
for element in geomShapeCollection:
geomShapeCollection is a cursor to a mongodb collection. There are several elements in geoShapeCollection where large amounts of elements fall, because these updates take such a considerable amount of time the geomShapeCollection cursor closes.
The problem was not with the update function at all. Adding a (timeout=False) to the outer cursor solves this problem.
for element in db.geomShapeCollectionName.find(timeout=False):
db.collectionName.update({"coordinates":{"$geoWithin":{"$geometry":element["geometry_part"]}}}, {"$set":{"Update_key": update_value}}, multi = True, timeout=False)

Related

How to insert data in redshift using either of boto3 or psycopg2 python libraries

Which library is best to use among "boto3" and "Psycopg2" for redshift operations in python lambda functions:
Lookup for a table in redshift cluster
Create a table in redshift cluster
Insert data in redshift cluster
I would appretiate if i am answered with following:
python code for either of the library that addresses all of the above 3 needs.
Thanks in Advance!!
Connecting directly to Redshift from Lambda with psycopg2 is the simpler, more straight-forward way to go but comes with a significant limitation. Lambda functions have run-time limits and even if your SQL commands don't exceed the max run-time, you will be paying for the Lambda function to wait for Redshift to complete the SQL. For fast-running SQL commands things run quickly and this isn't a problem but inserting data can take some time depending on the amount of data.
If all your Redshift actions are less than a few seconds (and won't grow longer with time) then psycopg2 connecting directly to Redshift is likely the way to go. If the data insert takes a minute or 2 BUT this process doesn't run very often (daily) then psycopg2 may still be the way to go as Lambda isn't very expensive when run in frequently. It is a process simplicity vs. cost calculation.
Using Redshift Data API is more complicated. This process lets you fire the SQL to Redshift and terminate the Lambda. A later running Lambda checks to see if the SQL has completed and the results of the SQL are checked. The SQL not completing means that Lambda needs to be invoke at a later time to see if things are complete. This polling process often is done by a Step Function and a set of different Lambda functions. Not super difficult but a level of complexity above a single Lambda. Since this is a polling process there is a wait time between checks for results which if too long leads to latency and if too short over-polling and additional costs.
If you need to have Data API for time-out reasons then you may want to use both psycopg2 for short running queries to the database - like 'does this table exist?'. Use Data API for long-running steps like 'insert this 1TB set of data into Redshift'.
Sample basic python code for all three operations using boto3.
import json
import boto3
clientdata = boto3.client('redshift-data')
# looks up table and returns true if found
def lookup_table(table_name):
response = clientdata.list_tables(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
TablePattern=table_name
)
print(response)
if ( len(response['Tables']) == 0 ):
return False
else:
return True
# creates table with one integer column
def create_table(table_name):
sqlstmt = 'CREATE TABLE '+table_name+' (col1 integer);'
print(sqlstmt)
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='CreateTable'
)
print(response)
# inserts one row with integer value for col1
def insert_data(table_name, dval):
print(dval)
sqlstmt = 'INSERT INTO '+table_name+'(col1) VALUES ('+str(dval)+');'
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='InsertData'
)
print(response)
result = lookup_table('date')
if ( result ):
print("Table exists.")
else:
print("Table does not exist!")
create_table("testtab")
insert_data("testtab", 11)
I am not using Lambda, instead executing it just from my shell. Hope this helps. Assuming credentials and default region are already set up for the client.

PyMongo max_time_ms

I would like to use the max_time_ms flag during a find on mongodb, but I woulld like to understand how this flag works and how to verify that it is working.
pymongo find().max_time_ms(500)
Is there any way to verify?
I tried to db.fsyncLock(), but I get this is applicable only for inserts.
I thought that a possible solution should be insert too many entries and reduce to max_time_ms(1), so the query will not have enough time to take results.
Any suggestions?
Tks
Passing the max_time_ms option this way
cursor = db.collection.find().max_time_ms(1)
or
cursor = db.collection.find(max_time_ms=1)
sets a time limit for the query and errors out with a pymongo.errors.ExecutionTimeout exception when the time limit specified is exceeded for the query.
Since cursors are lazy, this exception is raised when accessing results from the cursor e.g.
for doc in cursor:
print(doc)
ExecutionTimeout: operation exceeded time limit
max_time_ms (optional): Specifies a time limit for a query
operation. If the specified time is exceeded, the operation will be
aborted and :exc:~pymongo.errors.ExecutionTimeout is raised. Pass
this as an alternative to calling
[Source: Docs]

Postgres SSL SYSCALL error: EOF detected with python and psycopg

Using psycopg2 package with python 2.7 I keep getting the titled error: psycopg2.DatabaseError: SSL SYSCALL error: EOF detected
It only occurs when I add a WHERE column LIKE ''%X%'' clause to my pgrouting query. An example:
SELECT id1 as node, cost FROM PGR_Driving_Distance(
'SELECT id, source, target, cost
FROM edge_table
WHERE cost IS NOT NULL and column LIKE ''%x%'' ',
1, 10, false, false)
Threads on the internet suggest it is an issue with SSL intuitively, but whenever I comment out the pattern matching side of things the query and connection to the database works fine.
This is on a local database running Xubuntu 13.10.
After further investigation: It looks like this may be cause by the pgrouting extension crashing the database because it is a bad query and their are not links which have this pattern.
Will post an answer soon ...
The error: psycopg2.operationalerror: SSL SYSCALL error: EOF detected
The setup: Airflow + Redshift + psycopg2
When: Queries take a long time to execute (more than 300 seconds).
A socket timeout occurs in this instance. What solves this specific variant of the error is adding keepalive arguments to the connection string.
keepalive_kwargs = {
"keepalives": 1,
"keepalives_idle": 30,
"keepalives_interval": 5,
"keepalives_count": 5,
}
conection = psycopg2.connect(connection_string, **keepalive_kwargs)
Redshift requires a keepalives_idle of less than 300. A value of 30 worked for me, your mileage may vary. It is also possible that the keepalives_idle argument is the only one you need to set - but ensure keepalives is set to 1.
Link to docs on postgres keepalives.
Link to airflow doc advising on 300 timeout.
I ran into this problem when running a slow query in a Droplet on a Digital Ocean instance. All other SQL would run fine and it worked on my laptop. After scaling up to a 1 GB RAM instance instead of 512 MB it works fine so it seems that this error could occur if the process is running out of memory.
Very similar answer to what #FoxMulder900 did, except I could not get his first select to work. This works, though:
WITH long_running AS (
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '1 minutes'
and state = 'active'
)
SELECT * from long_running;
If you want to kill the processes from long_running just comment out the last line and insert SELECT pg_cancel_backend(long_running.pid) from long_running ;
This issue occurred for me when I had some rogue queries running causing tables to be locked indefinitely. I was able to see the queries by running:
SELECT * from STV_RECENTS where status='Running' order by starttime desc;
then kill them with:
SELECT pg_terminate_backend(<pid>);
I encountered the same error. By CPU, RAM usage everything was ok, solution by #antonagestam didn't work for me.
Basically, the issue was at the step of engine creation. pool_pre_ping=True solved the problem:
engine = sqlalchemy.create_engine(connection_string, pool_pre_ping=True)
What it does, is that each time when the connection is being used, it sends SELECT 1 query to check the connection. If it is failed, then the connection is recycled and checked again. Upon success, the query is then executed.
sqlalchemy docs on pool_pre_ping
In my case, I had the same error in python logs. I checked the log file in /var/log/postgresql/, and there were a lot of error messages could not receive data from client: Connection reset by peer and unexpected EOF on client connection with an open transaction. This can happen due to network issues.
In my case that was OOM killer (query is too heavy)
Check dmesg:
dmesg | grep -A2 Kill
In my case:
Out of memory: Kill process 28715 (postgres) score 150 or sacrifice child
I got this error running a large UPDATE statement on a 3 million row table. In my case it turned out the disk was full. Once I had added more space the UPDATE worked fine.
You may need to express % as %% because % is the placeholder marker. http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries

Python SQLite - How to manually BEGIN and END transactions?

Context
So I am trying to figure out how to properly override the auto-transaction when using SQLite in Python. When I try and run
cursor.execute("BEGIN;")
.....an assortment of insert statements...
cursor.execute("END;")
I get the following error:
OperationalError: cannot commit - no transaction is active
Which I understand is because SQLite in Python automatically opens a transaction on each modifying statement, which in this case is an INSERT.
Question:
I am trying to speed my insertion by doing one transaction per several thousand records.
How can I overcome the automatic opening of transactions?
As #CL. said you have to set isolation level to None. Code example:
s = sqlite3.connect("./data.db")
s.isolation_level = None
try:
c = s.cursor()
c.execute("begin")
...
c.execute("commit")
except:
c.execute("rollback")
The documentaton says:
You can control which kind of BEGIN statements sqlite3 implicitly executes (or none at all) via the isolation_level parameter to the connect() call, or via the isolation_level property of connections.
If you want autocommit mode, then set isolation_level to None.

How do I efficiently do a bulk insert-or-update with SQLAlchemy?

I'm using SQLAlchemy with a Postgres backend to do a bulk insert-or-update. To try to improve performance, I'm attempting to commit only once every thousand rows or so:
trans = engine.begin()
for i, rec in enumerate(records):
if i % 1000 == 0:
trans.commit()
trans = engine.begin()
try:
inserter.execute(...)
except sa.exceptions.SQLError:
my_table.update(...).execute()
trans.commit()
However, this isn't working. It seems that when the INSERT fails, it leaves things in a weird state that prevents the UPDATE from happening. Is it automatically rolling back the transaction? If so, can this be stopped? I don't want my entire transaction rolled back in the event of a problem, which is why I'm trying to catch the exception in the first place.
The error message I'm getting, BTW, is "sqlalchemy.exc.InternalError: (InternalError) current transaction is aborted, commands ignored until end of transaction block", and it happens on the update().execute() call.
You're hitting some weird Postgresql-specific behavior: if an error happens in a transaction, it forces the whole transaction to be rolled back. I consider this a Postgres design bug; it takes quite a bit of SQL contortionism to work around in some cases.
One workaround is to do the UPDATE first. Detect if it actually modified a row by looking at cursor.rowcount; if it didn't modify any rows, it didn't exist, so do the INSERT. (This will be faster if you update more frequently than you insert, of course.)
Another workaround is to use savepoints:
SAVEPOINT a;
INSERT INTO ....;
-- on error:
ROLLBACK TO SAVEPOINT a;
UPDATE ...;
-- on success:
RELEASE SAVEPOINT a;
This has a serious problem for production-quality code: you have to detect the error accurately. Presumably you're expecting to hit a unique constraint check, but you may hit something unexpected, and it may be next to impossible to reliably distinguish the expected error from the unexpected one. If this hits the error condition incorrectly, it'll lead to obscure problems where nothing will be updated or inserted and no error will be seen. Be very careful with this. You can narrow down the error case by looking at Postgresql's error code to make sure it's the error type you're expecting, but the potential problem is still there.
Finally, if you really want to do batch-insert-or-update, you actually want to do many of them in a few commands, not one item per command. This requires trickier SQL: SELECT nested inside an INSERT, filtering out the right items to insert and update.
This error is from PostgreSQL. PostgreSQL doesn't allow you to execute commands in the same transaction if one command creates an error. To fix this you can use nested transactions (implemented using SQL savepoints) via conn.begin_nested(). Heres something that might work. I made the code use explicit connections, factored out the chunking part and made the code use the context manager to manage transactions correctly.
from itertools import chain, islice
def chunked(seq, chunksize):
"""Yields items from an iterator in chunks."""
it = iter(seq)
while True:
yield chain([it.next()], islice(it, chunksize-1))
conn = engine.commit()
for chunk in chunked(records, 1000):
with conn.begin():
for rec in chunk:
try:
with conn.begin_nested():
conn.execute(inserter, ...)
except sa.exceptions.SQLError:
conn.execute(my_table.update(...))
This still won't have stellar performance though due to nested transaction overhead. If you want better performance try to detect which rows will create errors beforehand with a select query and use executemany support (execute can take a list of dicts if all inserts use the same columns). If you need to handle concurrent updates, you'll still need to do error handling either via retrying or falling back to one by one inserts.

Categories