PyMongo, Graphing

PyMongo, Graphing - python

I have several mongo databases (some populated with collections and documents, some empty) and I am trying to parse through them and create a graph for the contents. I am planning on making nodes for each db, each collection, and each key in the collection, and the from each key to the value (so skipping the pages). Here is my code for getting the graph.
for db in dbs:
G.add_node(db)
for col in c[db].collection_names():
G.add_node(col)
G.add_edge(db, col, weight = 0.9)
for page in c[db][col].find():
if (u'_id' in page.viewvalues()):
pprint.pprint(page)
G.add_node(page[u'_id'])
G.add_edge(col, page[u'_id'], weight = 0.4)
for key, value in page.items():
G.add_node(key)
G.add_edge(col, key, weight = 0.1)
G.add_node(value)
G.add_edge(key,value)
My Problem is that I never pass the if statement if (u'_id' in page.viewvalues()): I know I am getting pages (if I print the pages before the if statement I get a few thousand printed but the if statement is always false. What have I done wrong in accessing the dictionary returned from the find() query? Thanks.
EDIT:
I should probably also mention that when I do something like this
for i in page:
instead of the of the if statement it works for a bit and then breaks saying TypeError: unhashable type: 'dict' and I figured this was when it hit an empty page or when find() returned no pages.

This works for me:
import pymongo
c = pymongo.Connection()
dbs = c.database_names()
for db in dbs:
for col in c[db].collection_names():
for page in c[db][col].find():
if '_id' in page:
for key, value in page.iteritems():
print key, value
You always get a dictionary while iterating over pymongo cursor (which is returned by find()). So, you can just check if there is an _id key in the dictionary.
By the way, you can specify what fields to see in the results by providing the fields argument to the find().

Related

Elasticsearch Data query to python object

Problem: I want to pick a field of index in elasticsearch and look for all the values against it. Like if I give a key I should get the value for that key and if that key exists more than once, so each whats the each value. Or even if I get one of the values would work for me.
How I am trying to work through it: Query the elasticsearch
I am trying to query my data from Elasticsearch;
r = es.search(index="test",body = {'query': {'wildcard': {'task_name': "*"}}})
I thought to load the data to a python object ( dictionary) to read a key values. However, when I try json.loads(r.json) it gives me an error : AttributeError: 'dict' object has no attribute 'json'.
I even tried with json.load(r) but the error remains the same.

Python-Eve: Prevent inserting duplicates without using unique fields

I am trying to prevent inserting duplicate documents by the following approach:
Get a list of all documents from the desired endpoint which will contain all the documents in JSON-format. This list is called available_docs.
Use a pre_POST_<endpoint> hook in order to handle the request before inserting to the data. I am not using the on_insert hook since I need to do this before validation.
Since we can access the request object use request.json to get the payload JSON-formatted
Check if request.json is already contained in available_docs
Insert new document if it's not a duplicate only, abort otherwise.
Using this approach I got the following snippet:
def check_duplicate(request):
if not request.json in available_sims:
print('Not a duplicate')
else:
print('Duplicate')
flask.abort(422, description='Document is a duplicate and already in database.')
The available_docs list looks like this:
available_docs = [{'foo': ObjectId('565e12c58b724d7884cd02bb'), 'bar': [ObjectId('565e12c58b724d7884cd02b9'), ObjectId('565e12c58b724d7884cd02ba')]}]
The payload request.json looks like this:
{'foo': '565e12c58b724d7884cd02bb', 'bar': ['565e12c58b724d7884cd02b9', '565e12c58b724d7884cd02ba']}
As you can see, the only difference between the document which was passed to the API and the document already stored in the DB is the datatype of the IDs. Due to that fact, the if-statement in my above snippet evaluates to True and judges the document to be inserted not being a duplicate whereas it definitely is a duplicate.
Is there a way to check if a passed document is already in the database? I am not able to use unique fields since the combination of all document fields needs to be unique only. There is an unique identifier (which I left out in this example), but this is not suitable for the desired comparison since it is kind of a time stamp.
I think something like casting the given IDs at the keys foo and bar as ObjectIDs would do the trick, but I do not know how to to this since I do not know where to get the datatype ObjectID from.

You approach would be much slower than setting a unique rule for the field.
Since, from your example, you are going to compare objectids, can't you simply use those as the _id field for the collection? In Mongo (and Eve of course) that field is unique by default. Actually, you typically don't even define it. You would not need to do anything at all, as a POST of a document with an already existing id would fail right away.
If you can't go that way (maybe you need to compare a different objectid field and still, for some reason, you can't simply set a unique rule for the field), I would look at querying the db for the field value instead than getting all the documents from the db and then scanning them sequentially in code. Something like db.find({db_field: new_document_field_value}). If that returns true, new document is a duplicate. Make sure db_field is indexed (which usually holds true also for fields tagged with unique rule)
EDIT after the comments. A trivial implementation would probable be something like this:
def pre_POST_callback(resource, request):
# retrieve mongodb collection using eve connection
docs = app.data.driver.db['docs']
if docs.find_one({'foo': <value>}):
flask.abort(422, description='Document is a duplicate and already in database.')
app = Eve()
app.run()

Here's my approach on preventing duplicate records:
def on_insert_subscription(items):
c_subscription = app.data.driver.db['subscription']
user = decode_token()
if user:
for item in items:
if c_subscription.find_one({
'topic': ObjectId(item['topic']),
'client': ObjectId(user['user_id'])
}):
abort(422, description="Client already subscribed to this topic")
else:
item['client'] = ObjectId(user['user_id'])
else:
abort(401, description='Please provide proper credentials')
What I'm doing here is creating subscriptions for clients. If a client is already subscribed to a topic I throw 422.
Note: the client ID is decoded from the JWT token.

To find the _id of document other than during insertion

I have created several documents and inserted into Mongo DB.I am using python for the same . Is there a way where in I can get the _id of a particular record?
I know that we get the _id during insertion. But if i need to use it at a lateral interval is there a way I can get it by say using the find() command?

You can user the projection to get particular field from the document like this
db.collection.find({query},{_id:1}) this will return only _id
http://docs.mongodb.org/manual/reference/method/db.collection.find/

In python, you can use the find_one() method to get the document and access the _id property as following:
def get_id(value):
# Get the document with the record:
document = client.db.collection.find_one({'field': value}, {'_id': True})
return document['_id']

When you insert the record, you can specify the _id value. This has added benefits for when you're using ReplicaSets as well.
from pymongo.objectid import ObjectId
client.db.collection.insert({'_id': ObjectId(), 'key1': value....})
You could store those Ids in a list and use it later on if your requirement for needing the _id occurs immediately after insert.

sqlalchemy core integrity error

I'm working on parsing a file and inserting it into a database, using sqlalchemy core. I had it set up with the orm originally but that doesn't meet the speed requirements for the project.
My database has 2 tables: Objects and Attributes. The Objects table has a primary key of obj_id. The primary key for Attributes is composite: attr_name, attr_class, and obj_id, which is also a foreign key from Objects.
The attributes are stored after parsing the file in a list of dictionaries, like so:
[
{ 'obj_id' = obj_id, 'attr_name' = name, 'attr_class' = class, etc...},
{ ETC ETC ETC}]
The data is being inserted by first bulk inserting the objects, then the attributes. The object insert works perfectly. When inserting the attributes however, I get an integrity error, saying I tried to insert a duplicate primary key.
Here is my insert code for attributes:
self.engine.execute(
Attributes.__table__.insert(),
[{'obj_id' : attr['obj_id'],
'attr_name' : attr['attr_name'],
'attr_class': attr['attr_class'],
'attr_type' : attr['attr_type'],
'attr_size' : attr['attr_size']} for attr in attrList])
While trying to work this error out, I printed the id, name, and class of each attribute in the list to a file to find the duplicate key. Nowhere in the list is there actually an identical primary key, so this leads me to believe it is a problem with the structure of my query.
Can anyone figure this out with the info I've given, or give me somewhere to look for more information? I've already checked the documentation pretty thoroughly and couldn't find anything helpful.
Edit:
I also tried executing each insert statement separately, as suggested by someone on sqlalchemy's google group. The results were the same. The code I used:
insert = Attributes.__table__.insert()
for attr in attrList:
stmt = insert.values({'obj_id' : attr['obj_id'], ...})
self.engine.execute(stmt)
where ... was the rest of the values.
Edit 2:
The Integrity error is thrown as soon as I try to insert an attribute with the same name/class but a different object id. So for example:
In the format name-class-id:
By iteration 4, I've got:
Attr1-Class1-0
Attr2-Class2-0
Attr3-Class3-0
Attr4-Class4-0
On the next iteration, I try to insert Attr1-Class1-1, which fails.

I found the problem, completely unrelated to the insert code. When storing the data in the list, I was storing an Object as obj_id, which sqlalchemy didn't like. By fixing that I fixed the insertions.

pymongo sort and find_one issue

I am trying to sort a collection called user_score using the key position and get the very first document of the result. In this case the collection user_score doesn't exist and I was hoping to get the result as None, but i was getting a cursor back.
1.
result =
db.user_score.find({'score':'$lt':score}}).sort("position,pymongo.DESCENDING").limit(1)
Now i changed my query like below and did not get anything as expected.
2.
result =
db.user_score.find_one({'score':{'$lt':score}}, sort=[("position", pymongo.DESCENDING)])
What's the problem with my first query?
Thanks

A little late in my response but it appears that the current version of PyMongo does support a sort operation on a find_one call.
From the documentation page here:
All arguments to find() are also valid arguments for find_one(),
although any limit argument will be ignored. Returns a single
document, or None if no matching document is found.
Example usage is as follows:
filterdict = {'email' : 'this.is#me.com'}
collection.find_one(filterdict, sort=[('lastseen', 1)])
Hope this helps more recent searchers!

In your first query, in the sort function you're passing one argument ("position,pymongo.DESCENDING"), when you should be passing two arguments ("position", pymongo.DESCENDING).
Be sure to mind your quotation marks.

This is the default mongodb behavior on find. Whenever you use find you get a list of the result (in this case an iterable cursor). Only findOne - or it's PyMongo equivalent find_one will return None if the query has no matches.

Use list to convert the value of the cursor into a dict:
list(db.user_score.find({'score':'$lt':score}}).sort("position",pymongo.DESCENDING).limit(1))[0]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.