Python-Eve: Prevent inserting duplicates without using unique fields - python

I am trying to prevent inserting duplicate documents by the following approach:
Get a list of all documents from the desired endpoint which will contain all the documents in JSON-format. This list is called available_docs.
Use a pre_POST_<endpoint> hook in order to handle the request before inserting to the data. I am not using the on_insert hook since I need to do this before validation.
Since we can access the request object use request.json to get the payload JSON-formatted
Check if request.json is already contained in available_docs
Insert new document if it's not a duplicate only, abort otherwise.
Using this approach I got the following snippet:
def check_duplicate(request):
if not request.json in available_sims:
print('Not a duplicate')
else:
print('Duplicate')
flask.abort(422, description='Document is a duplicate and already in database.')
The available_docs list looks like this:
available_docs = [{'foo': ObjectId('565e12c58b724d7884cd02bb'), 'bar': [ObjectId('565e12c58b724d7884cd02b9'), ObjectId('565e12c58b724d7884cd02ba')]}]
The payload request.json looks like this:
{'foo': '565e12c58b724d7884cd02bb', 'bar': ['565e12c58b724d7884cd02b9', '565e12c58b724d7884cd02ba']}
As you can see, the only difference between the document which was passed to the API and the document already stored in the DB is the datatype of the IDs. Due to that fact, the if-statement in my above snippet evaluates to True and judges the document to be inserted not being a duplicate whereas it definitely is a duplicate.
Is there a way to check if a passed document is already in the database? I am not able to use unique fields since the combination of all document fields needs to be unique only. There is an unique identifier (which I left out in this example), but this is not suitable for the desired comparison since it is kind of a time stamp.
I think something like casting the given IDs at the keys foo and bar as ObjectIDs would do the trick, but I do not know how to to this since I do not know where to get the datatype ObjectID from.

You approach would be much slower than setting a unique rule for the field.
Since, from your example, you are going to compare objectids, can't you simply use those as the _id field for the collection? In Mongo (and Eve of course) that field is unique by default. Actually, you typically don't even define it. You would not need to do anything at all, as a POST of a document with an already existing id would fail right away.
If you can't go that way (maybe you need to compare a different objectid field and still, for some reason, you can't simply set a unique rule for the field), I would look at querying the db for the field value instead than getting all the documents from the db and then scanning them sequentially in code. Something like db.find({db_field: new_document_field_value}). If that returns true, new document is a duplicate. Make sure db_field is indexed (which usually holds true also for fields tagged with unique rule)
EDIT after the comments. A trivial implementation would probable be something like this:
def pre_POST_callback(resource, request):
# retrieve mongodb collection using eve connection
docs = app.data.driver.db['docs']
if docs.find_one({'foo': <value>}):
flask.abort(422, description='Document is a duplicate and already in database.')
app = Eve()
app.run()

Here's my approach on preventing duplicate records:
def on_insert_subscription(items):
c_subscription = app.data.driver.db['subscription']
user = decode_token()
if user:
for item in items:
if c_subscription.find_one({
'topic': ObjectId(item['topic']),
'client': ObjectId(user['user_id'])
}):
abort(422, description="Client already subscribed to this topic")
else:
item['client'] = ObjectId(user['user_id'])
else:
abort(401, description='Please provide proper credentials')
What I'm doing here is creating subscriptions for clients. If a client is already subscribed to a topic I throw 422.
Note: the client ID is decoded from the JWT token.

Related

Pymongo BulkWriteResult doesn't contain upserted_ids

Okey so currently I'm trying to upsert something in a local mongodb using pymongo.(I check to see if the document is in the db and if it is, update it, otherwise just insert it)
I'm using bulk_write to do that, and everything is working ok. The data is inserted/updated.
However, i would need the ids of the newly inserted/updated documents but the "upserted_ids" in the bulkWriteResult object is empty, even if it states that it inserted 14 documents.
I've added this screenshot with the variable. Is it a bug? or is there something i'm not aware of?
Finally, is there a way of getting the ids of the documents without actually searching for them in the db? (If possible, I would prefer to use bulk_write)
Thank you for your time.
EDIT:
As suggested, i added a part of the code so it's easier to get the general ideea:
for name in input_list:
if name not in stored_names: #completely new entry (both name and package)
operations.append(InsertOne({"name": name, "package" : [package_name]}))
if len(operations) == 0:
print ("## No new permissions to insert")
return
bulkWriteResult = _db_insert_bulk(collection_name,operations)
and the insert function:
def _db_insert_bulk(collection_name,operations_list):
return db[collection_name].bulk_write(operations_list)
The upserted_ids field in the pymongo BulkWriteResult only contains the ids of the records that have been inserted as part of an upsert operation, e.g. an UpdateOne or ReplaceOne with the upsert=True parameter set.
As you are performing InsertOne which doesn't have an upsert option, the upserted_ids list will be empty.
The lack of an inserted_ids field in pymongo's BulkWriteResult in an omission in the drivers; technically it conforms to crud specificaiton mentioned in D. SM's answer as it is annotated as "Drivers may choose to not provide this property.".
But ... there is an answer. If you are only doing inserts as part of your bulk update (and not mixed bulk operations), just use insert_many(). It is just as efficient as a bulk write and, crucially, does provide the inserted_ids value in the InsertManyResult object.
from pymongo import MongoClient
db = MongoClient()['mydatabase']
inserts = [{'foo': 'bar'}]
result = db.test.insert_many(inserts, ordered=False)
print(result.inserted_ids)
Prints:
[ObjectId('5fb92cafbe8be8a43bd1bde0')]
This functionality is part of crud specification and should be implemented by compliant drivers including pymongo. Reference pymongo documentation for correct usage.
Example in Ruby:
irb(main):003:0> c.bulk_write([insert_one:{a:1}])
=> #<Mongo::BulkWrite::Result:0x00005579c42d7dd0 #results={"n_inserted"=>1, "n"=>1, "inserted_ids"=>[BSON::ObjectId('5fb7e4b12c97a60f255eb590')]}>
Your output shows that zero documents were upserted, therefore there wouldn't be any ids associated with the upserted documents.
Your code doesn't appear to show any upserts at all, which again means you won't see any upserted ids.

Unable to retrieve data from MongoDB if same collection has different elements from python

I have a set of JSON in MongoDB collections which are received by webhooks and I don't have control of that and elements of one set wont be same for another. I'm able to retrieve those elements which has same key for all other data. But I need to retrieve those data irrespective of whether its present in other documents or not. Attaching the pic of values present in MongoDB.
I'm using below code to insert webhooks to MongoDB
#app.route('/webhook', methods=['POST', 'GET'])
def respond():
collection10 = db['webhooks']
a = request.get_json()
print(a)
collection10.insert_many(a)
return render_template("signin.html")
Suppose I try to retrieve "_id", I can easily retrieve since both the data has "_id". But if I try to retrieve those elements which are present in one and not in another I get an error.
I'm using this code to retrieve elements:
#app.route('/webhookdisplay', methods=['POST', 'GET'])
def webhooksdis():
collection10 = db['webhooks']
for i in collection10.find({}):
posts = i['name']
print(posts)
return render_template("webhooks.html", posts = posts)
For the above code I get error
KeyError: 'name'
If I retrieve "_id" in the same fashion as mentioned above it'll be retrieved.
Expected outcome: I need to retrieve nested data irrespective whether its present in other data or not. It would be great if there are any other approaches to display particular data in the form of table in HTML page
Purpose Once I get individual data, I can render the same in frontend using Jinja in the form of table
If you're not sure whether the returned record will contain a particular key, then you should use the built-in .get() function. which returns None by default if the key isn't present, unlike using the square bracket references. This will avoid the KeyError exception your are seeing.
posts = i.get('name')
if posts is None:
# Handle logic if name doesn't exist
EDIT: If you need a nested field:
name = i.get('data', {}).get('geofence_metadata', {}).get('name')

API Get method to get all tweets with hashtag count greater than within MongoDB in JSON format

I have a MongoDB database that contains a number of tweets. I want to be able to get all the tweets in JSON list through my API that contain a number of hashtags greather than that specified by the user in the url (eg http://localhost:5000/tweets?morethan=5, which is 5 in this case) .
The hashtags are contained inside the entities column in the database, along with other columns such as user_mentions, urls, symbols and media. Here is the code I've written so far but doesnt return anything.
#!flask/bin/python
app = Flask(__name__)
#app.route('/tweets', methods=['GET'])
def get_tweets():
# Connect to database and pull back collections
db = client['mongo']
collection = db['collection']
parameter = request.args.get('morethan')
if parameter:
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) # create the namespace#
cursor = collection.find({key_im_looking_for: {"$exists": True}})
EDIT: IT WORKS!
The code in question is this line
cursor = collection.find({"entities": {"hashtags": parameter}})
This answer explains why it is impossible to directly perform what you ask.
mongodb query: $size with $gt returns always 0
That answer also describes potential (but poor) ideas to get around it.
The best suggestion is to modify all your documents and put a "num_hashtags" key in somewhere, index that, and query against it.
Using The Twitter JSON API you could update all your documents and put a the num_hashtags key in the entities document.
Alternatively, you could solve your immediate problem by doing a very slow full table scan across all documents for every query checking if the hashtag number which is one greater than your parameter exists by abusing MongoDB Dot Notation.
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) #create the namespace#
# py2.7 => key_im_looking_for = "entities.hashtags.%s" %(gt_parameter)
# in this example it would be "entities.hashtags.6"
cursor = collection.find({key_im_looking_for: {"$exists": True}})
The best answer (and the key reason to use a NoSQL database in the first place) is that you should modify your data to suit your retrieval. If possible, you should perform an inplace update adding the num_hashtags key.

find_and_modify with upsert using Python-EVE

There is common use case when you need update or insert. For instance:
obj = db['data'].find_and_modify(
{
'Name': data['Name'],
'SourcePage': data['SourcePage'],
},
data,
upsert=True
)
Of course can split this request into GET and then PATCH or INSERT but maybe there is better way?
P.S. eve provides some nice features like document versions and meta data (_created, _updated etc.)
upsert support is now part of the upcoming release.
One doesn't have to do anything different. The feature is "turned on" by default. So if a user tries to PUT an item that does not exist, a new item will be created. The id field sent in the payload is ignored.
If a user does not want this feature, the user needs to explicitly set UPSERT_ON_PUT to False. Now, the user gets the "old" behaviour back. i.e when the user tries to PUT non-existing item, 404 is returned.

How to update the coloumn name in a datastore using google app engine

I have an entity with a coloumn name geocode which is initially null value.
I want to update the value of the geocode coloumn in the datastore.
I tried using the below snippet of code but it didn't work. Please Help. Thanks in advance.
The Datastore
class doctor(db.Model):
doctorUser=db.UserProperty()
geocode=db.GeoPtProperty()
The function to update the coloumn of doctor
def post(self):
user=users.get_current_user()
q=doctor.all().filter("doctorUser =",user)
if q.count()==1:
lat=self.request.get('lat')
lng=self.request.get('long')
code=str(lat)+", "+str(lng)
q[0].geocode=code
q[0].put()
SOLVED
Here is the change that i made and it worked!
def post(self):
user=users.get_current_user()
q=doctor.all().filter("doctorUser =",user)
qo=q.get()
if q.count()==1:
lat=self.request.get('lat')
lng=self.request.get('long')
code=str(lat)+", "+str(lng)
qo.geocode=code
qo.put()
You are trying to process a query object, you need to either fetch or iterate over the query to or use get() get results. The query that you are using could return more than one result,. unless you have logic elsewhere to ensure unique values for doctorUser.
Also grab your lat/long from the request before you enter the loop.
You should also stop thinking about the datastore and your model in terms of columns names. These are attributes of entities, there are no columns in the appengine datastore.
Whilst your solution you updated your original question might work it is inefficient. For instance you get an entity then do a count() which reruns the same query. I have rewritten it to be a bit more efficient, however there are still potential problems (such as multiple records that match doctorUser = user
def post(self):
user=users.get_current_user()
if user: # if user not logged in user will be None
qo=doctor.all().filter("doctorUser =",user).fetch(10)
if len(go) > 1:
raise ValueError('More than one doctorUser matches')
if go:
lat=self.request.get('lat')
lng=self.request.get('long')
code=str(lat)+", "+str(lng)
qo.geocode=code
qo.put()

Categories