I currently have a dictionary with data being pulled from an API, where I have given each datapoint it's own variable (job_id, jobtitle, company etc.):
output = {
'ID': job_id,
'Title': jobtitle,
'Employer' : company,
'Employment type' : emptype,
'Fulltime' : tid,
'Deadline' : deadline,
'Link' : webpage
}
that I want to add to my database, easy enough:
db.jobs.insert_one(output)
but this is all in a for loop that will create 30-ish unique new documents, with names, titles, links and whatnot, this script will be run more than once, so what I would like for it to do is only insert the "output" as a document if it doesn't already exist in the database, all of these new documents do have their own unique ID's coming from the job_id variable am I able to check against that?
You need to try two things :
1) Doing .find() & if no document found for given job_id then writing to DB is a two way call - Instead you can have an unique-index on job_id field, that will throw an error if your operation tries to insert duplicate document (Having unique index is much more safer way to avoid duplicates, even helpful if your code logic fails).
2) If you've 30 dict's - You no need to iterate for 30 times & use insert_one to make 30 database calls, instead you can use insert_many which takes in an array of dict's & writes to database.
Note : By default all dict's are written in the order they're in the array, in case if a dict fails cause of duplicate error then insert_many fails at that point without inserting rest others, So to overcome this you need to pass an option
ordered=False that way all dictionaries will be inserted except duplicates.
EDIT:
replace
db.jobs.insert_one(output)
with
db.jobs.replace_one({'ID': job_id}, output, upsert=True)
ORIGINAL ANSWER with worked example:
Use replace_one() with upsert=True. You can run this multiple times and it will with insert if the ID isn't found or replace if it is found. It wasn't quite what you were asking as the data is always updated (so newer data will overwrite any existing data).
from pymongo import MongoClient
db = MongoClient()['mydatabase']
for i in range(30):
db.employer.replace_one({'ID': i},
{
'ID': i,
'Title': 'jobtitle',
'Employer' : 'company',
'Employment type' : 'emptype',
'Fulltime' : 'tid',
'Deadline' : 'deadline',
'Link' : 'webpage'
}, upsert=True)
# Should always print 30 regardless of number of times run.
print(db.employer.count_documents({}))
Related
Well, as the title suggest I want to query on my DynamoDB table using GSI with the primary key and sort key (both from the GSI). I tried some ways to do it, but any success.
I have a table with the url-date-index, the url is the primary key from from the GSI, and the date is the sort key.
I tried the following:
Using KeyConditionExpression with & comparator:
This one retrieved me the error: TypeError: expected string or bytes-like
boto3.resource('dynamodb').Table('table').query(
IndexName='url-date-index',
KeyConditionExpression=conditions.Key('url')).eq(url) & conditions.Key('date')).eq(date)
)
Using KeyConditionExpression and FilterExpression:
This retrieved the following error: Filter Expression can only contain non-primary key attributes
boto3.resource('dynamodb').Table('table').query(
IndexName='url-date-index',
KeyConditionExpression=conditions.Key('url')).eq(url),
FilterExpression=conditions.Key('date')).eq(date)
)
Using ExpressionAttributeNames, ExpressionAttributeValues and KeyConditionExpression:
This returned anything, even not the item that matches the url and date on the table.
boto3.resource('dynamodb').Table('table').query(
IndexName='url-date-index',
ExpressionAttributeNames={
'#n0': 'url',
'#n1': 'date'
},
ExpressionAttributeValues={
':v0': url,
':v1': date
},
KeyConditionExpression='(#n0 = :v0) AND (#n1 = :v1)'
)
Does anyone know what I'm doing wrong or what I can do to make this work.
In your particular use-case, you'll want to use ExpressionAttributeNames since your attribute names url and date are reserved words in DynamoDB.
The DynamoDB docs on querying secondary idnexes gives an example of a properly structured query, which we can apply to your situation:
Using this as a guide, we can construct what the arguments to your query operation should look like. For example
{
"TableName": "table",
"IndexName": "url-date-index",
"KeyConditionExpression": "#pk = :pk And #sk = :sk",
"ExpressionAttributeNames": {"#pk":"url","#sk":"date"},
"ExpressionAttributeValues": {":pk": {"S":url},":sk": {"S":date}}}
}
If this still doesn't work for you, consider checking out the NoSQL Workbench For DynamoDB. Among it's many useful features, it has an Operation Builder that helps you construct DynamoDB operations using a graphical interface. You can even run the operation against your live database. Once you have the operation working as you want, the tool can then translate the operation into a complete Phython, Javascript(Node) or Java code sample, which you can use to see how the operation is constructed.
query is a function with some named arguments (IndexName, KeyConditionExpression, ...).
Let's try to call the function with named arguments as a normal function:
boto3.resource('dynamodb').Table('table').query(
IndexName='url-date-index',
KeyConditionExpression=conditions.Key('url')).eq(url) & conditions.Key('date')).eq(date)
)
I'm a beginner in mongodb and pymongo and I'm working on a project where I have a students mongodb collection . What I want is to add a new field and specifically an adrress of a student to each element in my collection (the field is obviously added everywhere as null and will be filled by me later).
However when I try using this specific example to add a new field I get a the following syntax error:
client = MongoClient('mongodb://localhost:27017/') #connect to local mongodb
db = client['InfoSys'] #choose infosys database
students = db['Students']
students.update( { $set : {"address":1} } ) #set address field to every column (error happens here)
How can I fix this error?
You are using the update operation in wrong manner. Update operation is having the following syntax:
db.collection.update(
<query>,
<update>,
<options>
)
The main parameter <query> is not at all mentioned. It has to be at least empty like {}, In your case the following query will work:
db.students.update(
{}, // To update the all the documents.
{$set : {"address": 1}}, // Update the address field.
{multi: true} // To do multiple updates, otherwise Mongo will just update the first matching document.
)
So, in python, you can use update_many to achieve this. So, it will be like:
students.update_many(
{},
{"$set" : {"address": 1}}
)
You can read more about this operation here.
The previous answer here is spot on, but it looks like your question may relate more to PyMongo and how it manages updates to collections. https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html
According to the docs, it looks like you may want to use the 'update_many()' function. You will still need to make your query (all documents, in this case) as the first argument, and the second argument is the operation to perform on all records.
client = MongoClient('mongodb://localhost:27017/') #connect to local mongodb
db = client['InfoSys'] #choose infosys database
students = db['Students']
sudents.update_many({}, {$set : {"address":1}})
I solved my problem by iterating through every element in my collection and inserting the address field to each one.
cursor = students.find({})
for student in cursor :
students.update_one(student, {'$set': {'address': '1'}})
Let's take this simple collection col with 2 documents:
{
"_id" : ObjectId("5ca4bf475e7a8e4881ef9dd2"),
"timestamp" : 1551736800,
"score" : 10
}
{
"_id" : ObjectId("5ca4bf475e7a8e4881ef9dd3"),
"timestamp" : 1551737400,
"score" : 12
}
To access the last timestamp (the one of the second document), I first did this request
a = db['col'].find({}).sort("_id", -1)
and then a[0]['timestamp']
But as there will be a lot of documents in this collection, i think that it would be more efficient to request only the last one with the limit function, like
a = db['col'].find({}).sort("_id", -1).limit(1)
and then
for doc in a:
lastTimestamp = doc['timestamp']
as there will be only one, i can declare the variable inside the loop.
So three questions :
Do i have to worry about memory / speed issues if i continue to use the first request and get the first element in the dic ?
Is there a smarter way to access the first element of the cursor instead of using a loop, when using the limit request ?
Is there another way to get that timestamp that i don't know ?
Thanks !
Python 3.6 / Pymongo 3.7
If you are using any field with an unique index in the selection criteria, you should use find_one method which will return the only document that matches your query.
That being said, the find method returns a Cursor object and does not load the data into memory.
You might get a better performance if you where using a filter option. Your query as it is now will do a collection scan.
if you are not using a filter, and want to retrieve the last document, then the clean way is with the Python built-in next function. You could also use the next method.
cur = db["col"].find().sort({"_id": -1}).limit(1):
with cur:
doc = next(cur, None) # None when we have empty collection.
find().sort() is so fast and don't worry about the speed and it's the best access the first element of the cursor.
Let's say I have a mongodb collection of the following layout:
{'number':1, '_id':...}
{'number':2, '_id':...}
{'number':4, '_id':...}
and so on. As demonstrated, not all the numbers currently present have to be consecutive.
I want to write code which (a) determines what is the highest value for number found in collection and then (b) inserts a new document whose value for number is 1 higher than the current largest.
So if this is the only code that operates on the collection, no particular value for number should be duplicated. The issue is that, done naively, this creates a race condition where two threads of this code running in parallel might find the same highest value and then insert the same next highest number twice.
So how would I do this atomically? I'm working in Python, so I would prefer a solution in that language, but I will accept an answer that explains the concept in a way that can be adapted to any language.
MongoEngine does what you're looking for in its SequenceField.
Create a new collection called indexes. This collection will look like this:
[
{ '_id': 'mydata.number', 'next': 5 }
]
Whenever you'd like to get and set the next index, you simply use the following statement:
counter = collection.find_and_modify(
query = { '_id': 'mydata.number' },
update = { '$inc': { 'next': 1 } },
new = True,
upsert = True)
What this does is it finds and updates the sequence atomically in MongoDB and retrieves the next number. If the sequence doesn't exist, it is generated.
Thus, whenever you want to insert a new value into your collection, call the code above. If you want to maintain multple indexes across different collections and their fields, simply modify mydata.number to be another string referencing your "index."
There is no clean transactional way to do this in MongoDB. This is why there is the ObjectID datatype. http://api.mongodb.org/python/current/api/bson/objectid.html
Or you can utilize a unique key inside python using something like UUID: https://docs.python.org/2/library/uuid.html
I am using pymongo to insert documents in the mongodb.
here is code for router.py file
temp = db.admin_collection.find().sort( [("_id", -1)] ).limit(1)
for doc in temp:
admin_id = str(int(doc['_id']) + 1)
admin_doc ={
'_id' : admin_id,
'question' : ques,
'answer' : ans,
}
collection.insert(admin_doc)
what should i do so that at every insert _id is incremented by 1.
It doesn’t seem like a very good idea, but if you really want to go through with it you can try setup like below.
It should work good enough in a low traffic application with single server, but I wouldn't try anything like this with replicated or sharded enviroment or if you perform large amount of inserts.
Create separate collection to handle id seqs:
db.seqs.insert({
'collection' : 'admin_collection',
'id' : 0
})
Whenever you need to insert new document use something similar to this:
def insert_doc(doc):
doc['_id'] = str(db.seqs.find_and_modify(
query={ 'collection' : 'admin_collection' },
update={'$inc': {'id': 1}},
fields={'id': 1, '_id': 0},
new=True
).get('id'))
try:
db.admin_collection.insert(doc)
except pymongo.errors.DuplicateKeyError as e:
insert_doc(doc)
If you want to manually change the "_id" value you can do this by changing the _id value in the returned document. You could do this in a similar manner to the way you have proposed in your question. However I do not think this approach is advisable.
curs = db.admin_collection.find().sort( [("_id", -1)] ).limit(1)
for document in curs:
document['_id'] = str(int(document['_id']) + 1)
collection.insert(document)
It is generally not a good idea to manually make your own id's. These values have to be unique and there is no guarantee that str(int(document['_id']) + 1) will always be unique.
Instead if you want to duplicate the document you can delete the '_id' key and insert the document.
curs = db.admin_collection.find().sort( [("_id", -1)] ).limit(1)
for document in curs:
document.pop('_id',None)
collection.insert(document)
This inserts the document and allows mongo to generate the unique id.
Way late to this, but what about leaving the ObjectId alone and still adding a sequential id to use as a reference for calling the particular document(or props thereof) from a frontend api? I've been struggling to get the frontend to drop the "" on the ObjectId when fetching from the api.