Reattempting of failed bulk inserts in pymongo - python

I am trying to do a bulk insert of documents into a MongoDB collection in Python, using pymongo. This is what the code looks like:
collection_name.insert_many([ logs[i] for i in range (len(logs)) ])
where logs is a list of dictionaries of variable length.
This works fine when there are no issues with any of the logs. However, if any one of the logs has some kind of issue and pymongo refuses to save it (say, the issue is something like the document fails to match the validation schema set for that collection), the entire bulk insert is rolled back and no documents are inserted in the database.
Is there any way I can retry the bulk insert by ignoring only the defective log?

You can ignore those types of errors by specifying ordered: false as an option: collection.insert_many(logs, ordered=False). All operations are attempted before raising an exception, which you can catch.
See https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert_many

Related

How to partially rollback?

I currently populate a database from a third party API that involves downloading files containing multiple SQL INSERT/DELETE/UPDATE statements and then parsing them into SQLAlchemy ORM objects to load into my database.
These files can often contain errors that I've tried to build in some integrity checks for. The particular one I'm currently struggling with is duplicate records - basically receiving a file to insert a record that currently exists. To avoid this I put a unique index on the fields that form a composite primary key. However, this means I get an error when processing a file with an SQL statement that tries to duplicate a record and a flush or commit is subsequently issued.
I don't want to commit records to the database until all the SQL statements for a given file have been processed so I can keep track of what's been processed. I was thinking that I could issue a flush at the end of the processing of every statement and then have some error handling if it fails because of a duplicate record. This would include bypassing the offending statement. However, as I understand the docs then issuing a rollback would cancel all the previous statements that had been processed to that point when I only want to skip the duplicate one.
Is there an option to partially rollback in some way or do I need to build a check up front that queries the database to check if executing an SQL statement will create a duplicate record?

Bigquery data not getting inserted

I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also
client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.
It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability
got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.

Rollback on elasticsearch bulk insert failure

Use case: Right now I am using Python elasticsearch API form bulk insert.
I have try except wrapped around my bulk insert code but whenever any exception comes elastisearch doesn't rollback inserted document.
And also doesn't try to insert the remaining documents other than the corrupted document which throws an exception.
There is no such thing, a bulk queries contains a list of queries to insert / update or delete documents.
elasticsearch is not a database (no ACID / no transaction etc..), so you must create a rollback feature yourself.

Mongoengine: Check if document is already in DB

I am working on a kind of initialization routine for a MongoDB using mongoengine.
The documents we deliver to the user are read from several JSON files and written into the database at the start of the application using the above mentioned init routine.
Some of these documents have unique keys which would raise a mongoengine.errors.NotUniqueError error if a document with a duplicate key is passed to the DB. This is not a problem at all since I am able to catch those errors using try-except.
However, some other documents are something like a bunch of values or parameters. So there is no unique key which a can check in order to prevent those from being inserted to the DB twice.
I thought I could read all existing documents from the desired collection like this:
docs = MyCollection.objects()
and check whether the document to be inserted is already available in docs by using:
doc = MyCollection(parameter='foo')
print(doc in docs)
Which prints false even if there is a MyCollection(parameter='foo') document in the the DB already.
How can I achieve a duplicate detection without using unique keys?
You can check using an if statement:
if not MyCollection.objects(parameter='foo'):
# insert your documents

Inconsistent results being returned by mongodb find query

I ran the following query on a collection in my mongodb database.
db.coll.find({field_name:{$exists:true}}).count() and it returned 2437185. The total records reported by db.coll.find({}).count() is 2437228 .
Now when i run the query db.coll.find({field_name:{$exists:false}}).count() , instead of returning 43 it returned 0.
I have the following two questions :
Does the above scenario mean that the data in my collection has become corrupt ?.
I had posted a question earlier about this at ( Updating records in MongoDB through pymongo leads to deletion of most of them). The person who replied said that updating data in mongo db could blank out the data but not delete it. What does that mean ?.
Thank You
I believe you're running into the issue reported at SERVER-1587. What version of MongoDB are you using? If it is less than 1.9.0, you can use the following as a work-around:
db.coll.find({field_name: {$not: {$exists: true}}}).count()
As for the other question, "blanking out" in this case means that an update can change the value of or unset any or all fields in a document, but can't remove the document itself. The only way to remove a document is with remove()

Categories