Bulk insert MongoDB - Pymongo limit issue - python

I was trying to bulk insert documents to MongoDB with Pymongo if no match on already existing ids (sourceID). However, most of my documents include quite long texts and therefore cannot be inserted - I am not sure whether this is due to a size limit or a character limit (it works for documents with little text). Instead I tried inserting with insert_many() but to my knowledge this will not do the required task of only inserting documents if no match on id (sourceID) leaving exiisting documents with a match unchanged.
Are there any alternative solutions to reach my desired goal?
This is the code that I have written for the bulk insert:
bulk = pymongo.bulk.BulkOperationBuilder(events,ordered=False)
for doc in test:
bulk.find({ "sourceID": doc["sourceID"] }).upsert().update({
"$setOnInsert": doc
})
response = bulk.execute()
'events' is the database that I am inserting to and 'test' is the array that I am trying to insert.

Related

Finding document containing array of nested names in pymongo (CrossRef data)

I have a dataset of CrossRef works records stored in a collection called works in MongoDB and I am using a Python application to query this database.
I am trying to find documents based on one author's name. Removing extraneous details, a document might look like this:
{'DOI':'some-doi',
'author':[{'given': 'Albert','family':'Einstein',affiliation:[]},
{'given':'R.','family':'Feynman',affiliation:[]},
{'given':'Paul','family':'Dirac',affiliation:['University of Florida']}]
}
It isn't clear to me how to combine the queries to get just Albert Einstein's papers.
I have indexes on author.family and author.given, I've tried:
cur = works.find({'author.family':'Einstein','author.given':'Albert'})
This returns all of the documents by people called 'Albert' and all of those by people called 'Einstein'. I can filter this manually, but it's obviously less than ideal.
I also tried:
cur = works.find({'author':{'given':'Albert','family':'Einstein','affiliation':[]}})
But this returns nothing (after a very long delay). I've tried this with and without 'affiliation'. There are a few questions on SO about querying nested fields, but none seem to concern the case where we're looking for 2 specific things in 1 nested field.
Your issue is that author is a list.
You can use an aggregate query to unwind this list to objects, and then your query would work:
cur = works.aggregate([{'$unwind': '$author'},
{'$match': {'author.family':'Einstein', 'author.given':'Albert'}}])
Alternatively, use $elemMatch which matches on arrays that match all the elements specified.
cur = works.find({"author": {'$elemMatch': {'family': 'Einstein', 'given': 'Albert'}}})
Also consider using multikey indexes.

Pymongo BulkWriteResult doesn't contain upserted_ids

Okey so currently I'm trying to upsert something in a local mongodb using pymongo.(I check to see if the document is in the db and if it is, update it, otherwise just insert it)
I'm using bulk_write to do that, and everything is working ok. The data is inserted/updated.
However, i would need the ids of the newly inserted/updated documents but the "upserted_ids" in the bulkWriteResult object is empty, even if it states that it inserted 14 documents.
I've added this screenshot with the variable. Is it a bug? or is there something i'm not aware of?
Finally, is there a way of getting the ids of the documents without actually searching for them in the db? (If possible, I would prefer to use bulk_write)
Thank you for your time.
EDIT:
As suggested, i added a part of the code so it's easier to get the general ideea:
for name in input_list:
if name not in stored_names: #completely new entry (both name and package)
operations.append(InsertOne({"name": name, "package" : [package_name]}))
if len(operations) == 0:
print ("## No new permissions to insert")
return
bulkWriteResult = _db_insert_bulk(collection_name,operations)
and the insert function:
def _db_insert_bulk(collection_name,operations_list):
return db[collection_name].bulk_write(operations_list)
The upserted_ids field in the pymongo BulkWriteResult only contains the ids of the records that have been inserted as part of an upsert operation, e.g. an UpdateOne or ReplaceOne with the upsert=True parameter set.
As you are performing InsertOne which doesn't have an upsert option, the upserted_ids list will be empty.
The lack of an inserted_ids field in pymongo's BulkWriteResult in an omission in the drivers; technically it conforms to crud specificaiton mentioned in D. SM's answer as it is annotated as "Drivers may choose to not provide this property.".
But ... there is an answer. If you are only doing inserts as part of your bulk update (and not mixed bulk operations), just use insert_many(). It is just as efficient as a bulk write and, crucially, does provide the inserted_ids value in the InsertManyResult object.
from pymongo import MongoClient
db = MongoClient()['mydatabase']
inserts = [{'foo': 'bar'}]
result = db.test.insert_many(inserts, ordered=False)
print(result.inserted_ids)
Prints:
[ObjectId('5fb92cafbe8be8a43bd1bde0')]
This functionality is part of crud specification and should be implemented by compliant drivers including pymongo. Reference pymongo documentation for correct usage.
Example in Ruby:
irb(main):003:0> c.bulk_write([insert_one:{a:1}])
=> #<Mongo::BulkWrite::Result:0x00005579c42d7dd0 #results={"n_inserted"=>1, "n"=>1, "inserted_ids"=>[BSON::ObjectId('5fb7e4b12c97a60f255eb590')]}>
Your output shows that zero documents were upserted, therefore there wouldn't be any ids associated with the upserted documents.
Your code doesn't appear to show any upserts at all, which again means you won't see any upserted ids.

Firebase is only responding with a single document when I query it even though multiple meet the query criteria

I'm trying to write a cloud function that returns users near a specific location. get_nearby() returns a list of tuples containing upper and lower bounds for a geohash query, and then this loop should query firebase for users within those geohashes.
user_ref = db.collection(u'users')
db_response = []
for query_range in get_nearby(lat, long, radius):
query = user_ref.where(u'geohash', u'>=', query_range[0]).where(u'geohash', u'<=', query_range[1]).get()
for el in query:
db_response.append(el.to_dict())
For some reason when I run this code, it returns only one document from my database, even though there are three other documents with the same geohash as that one. I know the documents are there, and they do get returned when I request the entire collection. What am I missing here?
edit:
The database currently has 4 records in it, 3 of which should be returned in this query:
{
{name: "Trevor", geohash: "dnvtz"}, #this is the one that gets returned
{name: "Test", geohash: "dnvtz"},
{name: "Test", geohash: "dnvtz"}
}
query_range is a tuple with two values. A lower and upper bound geohash. In this case, it's ("dnvt0", "dnvtz").
I decided to clear all documents from my database and then generate a new set of sample data to work with (everything there was only for testing anyway, nothing important). After pushing the new data to Firestore, everything is working. My only assumption is that even though the strings matched up, I'd used the wrong encoding on some of them.

API Get method to get all tweets with hashtag count greater than within MongoDB in JSON format

I have a MongoDB database that contains a number of tweets. I want to be able to get all the tweets in JSON list through my API that contain a number of hashtags greather than that specified by the user in the url (eg http://localhost:5000/tweets?morethan=5, which is 5 in this case) .
The hashtags are contained inside the entities column in the database, along with other columns such as user_mentions, urls, symbols and media. Here is the code I've written so far but doesnt return anything.
#!flask/bin/python
app = Flask(__name__)
#app.route('/tweets', methods=['GET'])
def get_tweets():
# Connect to database and pull back collections
db = client['mongo']
collection = db['collection']
parameter = request.args.get('morethan')
if parameter:
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) # create the namespace#
cursor = collection.find({key_im_looking_for: {"$exists": True}})
EDIT: IT WORKS!
The code in question is this line
cursor = collection.find({"entities": {"hashtags": parameter}})
This answer explains why it is impossible to directly perform what you ask.
mongodb query: $size with $gt returns always 0
That answer also describes potential (but poor) ideas to get around it.
The best suggestion is to modify all your documents and put a "num_hashtags" key in somewhere, index that, and query against it.
Using The Twitter JSON API you could update all your documents and put a the num_hashtags key in the entities document.
Alternatively, you could solve your immediate problem by doing a very slow full table scan across all documents for every query checking if the hashtag number which is one greater than your parameter exists by abusing MongoDB Dot Notation.
gt_parameter = int(parameter) + 1 # question said greater than not greater or equal
key_im_looking_for = "entities.hashtags.{}".format(gt_parameter) #create the namespace#
# py2.7 => key_im_looking_for = "entities.hashtags.%s" %(gt_parameter)
# in this example it would be "entities.hashtags.6"
cursor = collection.find({key_im_looking_for: {"$exists": True}})
The best answer (and the key reason to use a NoSQL database in the first place) is that you should modify your data to suit your retrieval. If possible, you should perform an inplace update adding the num_hashtags key.

Bulk upsert not working pymongo

I am trying to update the document if it exists and insert if it does not exist in a collection. I am inserting pandas data frame records as documents to collection based on _id. The insert of new document is working fine, but the update of fields in the old document is not working.
bulk = pymongo.bulk.BulkOperationBuilder(pros_rides,ordered=False)
for doc in bookings_df:
bulk.find({ "_id": doc["_id"] }).upsert().update({
"$setOnInsert": doc
})
response = bulk.execute()
What do I miss?
An upsert can either update or insert a document; the "$setOnInsert" operation is only executed when the document is inserted, not when it is updated. In order to update the document if it exists, you must provide some operations that will be executed when the document is updated.
Try something like this instead:
bulk = pros_rides.initialize_unordered_bulk_op()
for doc in books_df:
bulk.find({'_id': doc['_id']}).upsert().replace_one(doc)
bulk.execute()

Categories