Finding document containing array of nested names in pymongo (CrossRef data) - python

I have a dataset of CrossRef works records stored in a collection called works in MongoDB and I am using a Python application to query this database.
I am trying to find documents based on one author's name. Removing extraneous details, a document might look like this:
{'DOI':'some-doi',
'author':[{'given': 'Albert','family':'Einstein',affiliation:[]},
{'given':'R.','family':'Feynman',affiliation:[]},
{'given':'Paul','family':'Dirac',affiliation:['University of Florida']}]
}
It isn't clear to me how to combine the queries to get just Albert Einstein's papers.
I have indexes on author.family and author.given, I've tried:
cur = works.find({'author.family':'Einstein','author.given':'Albert'})
This returns all of the documents by people called 'Albert' and all of those by people called 'Einstein'. I can filter this manually, but it's obviously less than ideal.
I also tried:
cur = works.find({'author':{'given':'Albert','family':'Einstein','affiliation':[]}})
But this returns nothing (after a very long delay). I've tried this with and without 'affiliation'. There are a few questions on SO about querying nested fields, but none seem to concern the case where we're looking for 2 specific things in 1 nested field.

Your issue is that author is a list.
You can use an aggregate query to unwind this list to objects, and then your query would work:
cur = works.aggregate([{'$unwind': '$author'},
{'$match': {'author.family':'Einstein', 'author.given':'Albert'}}])
Alternatively, use $elemMatch which matches on arrays that match all the elements specified.
cur = works.find({"author": {'$elemMatch': {'family': 'Einstein', 'given': 'Albert'}}})
Also consider using multikey indexes.

Related

Pymongo BulkWriteResult doesn't contain upserted_ids

Okey so currently I'm trying to upsert something in a local mongodb using pymongo.(I check to see if the document is in the db and if it is, update it, otherwise just insert it)
I'm using bulk_write to do that, and everything is working ok. The data is inserted/updated.
However, i would need the ids of the newly inserted/updated documents but the "upserted_ids" in the bulkWriteResult object is empty, even if it states that it inserted 14 documents.
I've added this screenshot with the variable. Is it a bug? or is there something i'm not aware of?
Finally, is there a way of getting the ids of the documents without actually searching for them in the db? (If possible, I would prefer to use bulk_write)
Thank you for your time.
EDIT:
As suggested, i added a part of the code so it's easier to get the general ideea:
for name in input_list:
if name not in stored_names: #completely new entry (both name and package)
operations.append(InsertOne({"name": name, "package" : [package_name]}))
if len(operations) == 0:
print ("## No new permissions to insert")
return
bulkWriteResult = _db_insert_bulk(collection_name,operations)
and the insert function:
def _db_insert_bulk(collection_name,operations_list):
return db[collection_name].bulk_write(operations_list)
The upserted_ids field in the pymongo BulkWriteResult only contains the ids of the records that have been inserted as part of an upsert operation, e.g. an UpdateOne or ReplaceOne with the upsert=True parameter set.
As you are performing InsertOne which doesn't have an upsert option, the upserted_ids list will be empty.
The lack of an inserted_ids field in pymongo's BulkWriteResult in an omission in the drivers; technically it conforms to crud specificaiton mentioned in D. SM's answer as it is annotated as "Drivers may choose to not provide this property.".
But ... there is an answer. If you are only doing inserts as part of your bulk update (and not mixed bulk operations), just use insert_many(). It is just as efficient as a bulk write and, crucially, does provide the inserted_ids value in the InsertManyResult object.
from pymongo import MongoClient
db = MongoClient()['mydatabase']
inserts = [{'foo': 'bar'}]
result = db.test.insert_many(inserts, ordered=False)
print(result.inserted_ids)
Prints:
[ObjectId('5fb92cafbe8be8a43bd1bde0')]
This functionality is part of crud specification and should be implemented by compliant drivers including pymongo. Reference pymongo documentation for correct usage.
Example in Ruby:
irb(main):003:0> c.bulk_write([insert_one:{a:1}])
=> #<Mongo::BulkWrite::Result:0x00005579c42d7dd0 #results={"n_inserted"=>1, "n"=>1, "inserted_ids"=>[BSON::ObjectId('5fb7e4b12c97a60f255eb590')]}>
Your output shows that zero documents were upserted, therefore there wouldn't be any ids associated with the upserted documents.
Your code doesn't appear to show any upserts at all, which again means you won't see any upserted ids.

Firebase is only responding with a single document when I query it even though multiple meet the query criteria

I'm trying to write a cloud function that returns users near a specific location. get_nearby() returns a list of tuples containing upper and lower bounds for a geohash query, and then this loop should query firebase for users within those geohashes.
user_ref = db.collection(u'users')
db_response = []
for query_range in get_nearby(lat, long, radius):
query = user_ref.where(u'geohash', u'>=', query_range[0]).where(u'geohash', u'<=', query_range[1]).get()
for el in query:
db_response.append(el.to_dict())
For some reason when I run this code, it returns only one document from my database, even though there are three other documents with the same geohash as that one. I know the documents are there, and they do get returned when I request the entire collection. What am I missing here?
edit:
The database currently has 4 records in it, 3 of which should be returned in this query:
{
{name: "Trevor", geohash: "dnvtz"}, #this is the one that gets returned
{name: "Test", geohash: "dnvtz"},
{name: "Test", geohash: "dnvtz"}
}
query_range is a tuple with two values. A lower and upper bound geohash. In this case, it's ("dnvt0", "dnvtz").
I decided to clear all documents from my database and then generate a new set of sample data to work with (everything there was only for testing anyway, nothing important). After pushing the new data to Firestore, everything is working. My only assumption is that even though the strings matched up, I'd used the wrong encoding on some of them.

meaning of distinct and how to use it in python(pymongo)

I don't understand what is the meaning of distinct and how use of it. i have search for related answer but it seems like distinct is somehow related to list. really appreciate the help.
list_of_stocks = db.stocks.distinct("symbol")
As the OP confirmed, this is a PyMongo call to a MongoDB database, which allows for a distinct find, in the form of:
db.collection_name.distinct("property_name")
This returns all distinct values for a given property in a collection.
Optionally, if you specify a document filter (effectively a find() filter) as a second parameter, your query will first be reduced by that filter, then the distinct will be applied. For example:
list_of_stocks = db.stocks.distinct("symbol", {"exchange": "NASDAQ"})
Distinct keyword is used in DB and it is to return record set only with distinct elements for that particular column.

Multiple linked list with SQLAlchemy and MySQL

I want to have multiple linked list in a SQL table, using MySQL and SQLAlchemy (0.7). All lists with it's first node with parent being 0, and ends with child being 0. The id represents the list, and not the indevidiual element. The element is identified by PK
With some omitted syntax (not relevant to the problem) it should look something like this:
id(INT, PK)
content (TEXT)
parent(INT, FK(id), PK)
child(INT, FK(id), PK)
As the table has multiple linked lists how can return the entire list from the database I select a specific ID and parent is 0?
For example:
SELECT * FROM ... WHERE id = 3 AND parent = 0
Given that you have multiple linked lists stored in the same table, I assume that you store either the HEAD and/or the TAIL of those in some other tables. Few ideas:
1) Keep the linked list:
The first big improvement (also proposed in the comments) from the data-querying perspective would be to have some common identifier (lets call it ListID) of all the nodes in the same list. Here there are few options:
If each list is referenced only from one object (data row) [I would even phrase the question like "Does the list belong to a single object?], then this ListID could simply be the (primary) identifier of the holder object with the ForeignKey on top to ensure data integrity.
In this case, querying all list is very simple. In fact, you can define the relationship and navigate it like my_object.my_list_items.
If the list is used/referenced by multiple objects, then one could create another table which will consist only of one column ListID (PK), and each Node/Item will again have a ForeignKey to it, or something similar
Else, large lists can be loaded in two queries/SQL statements:
query the HEAD/TAIL by its ID
query the whole list based on received ListID of the HEAD/TAIL
In fact, this can be done with one query like the one below (Single-query example), which is more efficient from the IO perspective, but doing it in two steps has the advantage that you immediately have a reference to the HEAD (or TAIL) node.
Single-query example:
# single-query using join (not tested)
Head = alias(Node)
qry = session.query(Node).join(Head, Node.ListID == Head.ListID).filter(Head.ID == head_node_id)
Iin any case, in order to traverse the linked list, you would have to get the HEAD/TAIL by its ID, then traverse as usual.
Note: Here I am not certain if SA would recognize that the reference objects are already loaded into session, or will issue other SQL statements for each of these, which will defeat the purpose of bulk loading.
2) Replace linked list with Ordering List extension:
Please read the Ordering List documentation. It well might be that Ordering List implementation will be good enough for you to use instead of the linked list

Djapian - filtering results

I use Djapian to search for object by keywords, but I want to be able to filter results. It would be nice to use Django's QuerySet API for this, for example:
if query.strip():
results = Model.indexer.search(query).prefetch()
else:
results = Model.objects.all()
results = results.filter(somefield__lt=somevalue)
return results
But Djapian returns a ResultSet of Hit objects, not Model objects. I can of course filter the objects "by hand", in Python, but it's not realistic in case of filtering all objects (when query is empty) - I would have to retrieve the whole table from database.
Am I out of luck with using Djapian for this?
I went through its source and found that Djapian has a filter method that can be applied to its results. I have just tried the below code and it seems to be working.
My indexer is as follows:
class MarketIndexer( djapian.Indexer ):
fields = [ 'name', 'description', 'tags_string', 'state']
tags = [('state', 'state'),]
Here is how I filter results (never mind the first line that does stuff for wildcard usage):
objects = model.indexer.search(q_wc).flags(djapian.resultset.xapian.QueryParser.FLAG_WILDCARD).prefetch()
objects = objects.filter(state=1)
When executed, it now brings Markets that have their state equal to "1".
I dont know Djapian, but i am familiar with xapian. In Xapian you can filter the results with a MatchDecider.
The decision function of the match decider gets called on every document which matches the search criteria so it's not a good idea to do a database query for every document here, but you can of course access the values of the document.
For example at ubuntuusers.de we have a xapian database which contains blog posts, forum posts, planet entries, wiki entries and so on and each document in the xapian database has some additional access information stored as value. After the query, an AuthMatchDecider filters the potential documents and returns the filtered MSet which are then displayed to the user.
If the decision procedure is as simple as somefield < somevalue, you could also simply add the value of somefield to the values of the document (using the sortable_serialize function provided by xapian) and add (using OP_FILTER) an OP_VALUE_RANGE query to the original query.

Categories