Slow MongoDB/pymongo query - python

I am submitting a pretty simple query to MongoDB (version 2.6) using the pymongo library for python:
query = {"type": "prime"}
logging.info("Querying the DB")
docs = usaspending.get_records_from_db(query)
logging.info("Done querying. Sorting the results")
docs.sort("timestamp", pymongo.ASCENDING)
logging.info("Done sorting the results, getting count")
count = docs.count(True)
logging.info("Done counting: %s records", count)
pprint(docs[0])
raise Exception("End the script right here")
The get_records_from_db() function is quite simple:
def get_records_from_db(query=None):
return db.raws.find(query, batch_size=50)
Note that I will actually need to work with all the documents, not just docs[0]. I am just trying to get docs[0] as an example.
When I run this query the output I get is:
2015-01-28 10:11:05,945 Querying the DB
2015-01-28 10:11:05,946 Done querying. Sorting the results
2015-01-28 10:11:05,946 Done sorting the results, getting count
2015-01-28 10:11:06,617 Done counting: 559952 records
However I never get back docs[0]. I have an indexes on {"timestamp": 1} and {"type": 1}, and queries seem to work reasonably well (as the count is returned quite fast), but I am not sure why I never get back the actual document (the docs are quite small [under 50K]).

PyMongo does no actual work on the server when you execute these lines:
query = {"type": "prime"}
docs = usaspending.get_records_from_db(query)
docs.sort("timestamp", pymongo.ASCENDING)
At this point "docs" is just a PyMongo Cursor, but it has not executed the query on the server. If you run "count" on the Cursor, then PyMongo does a "count" command on the server and returns the result, but the Cursor itself still hasn't been executed.
However, when you run this:
docs[0]
Then in order to get the first result, PyMongo runs the query on the server. The query is filtered on "type" and sorted by "timestamp", so try this on the mongo shell to see what's wrong with the query:
> db.collection.find({type: "prime"}).sort({timestamp: 1}).limit(1).explain()
If you see a very large "nscanned" or "nscannedObjects", that's the problem. You probably need a compound index on type and timestamp (order matters):
> db.collection.createIndex({type: 1, timestamp: 1})
See my article on compound indexes.

The reason you never get back the actual documents is because Mongo batches together these commands into one query, so you would look at it this way:
find the records then sort the records and then count the records.
You need to build two totally separate queries:
find the records, sort the records, then give me the records
count the records
If you chain them, Mongo will chain them and think that they are one command.

Related

Finding document containing array of nested names in pymongo (CrossRef data)

I have a dataset of CrossRef works records stored in a collection called works in MongoDB and I am using a Python application to query this database.
I am trying to find documents based on one author's name. Removing extraneous details, a document might look like this:
{'DOI':'some-doi',
'author':[{'given': 'Albert','family':'Einstein',affiliation:[]},
{'given':'R.','family':'Feynman',affiliation:[]},
{'given':'Paul','family':'Dirac',affiliation:['University of Florida']}]
}
It isn't clear to me how to combine the queries to get just Albert Einstein's papers.
I have indexes on author.family and author.given, I've tried:
cur = works.find({'author.family':'Einstein','author.given':'Albert'})
This returns all of the documents by people called 'Albert' and all of those by people called 'Einstein'. I can filter this manually, but it's obviously less than ideal.
I also tried:
cur = works.find({'author':{'given':'Albert','family':'Einstein','affiliation':[]}})
But this returns nothing (after a very long delay). I've tried this with and without 'affiliation'. There are a few questions on SO about querying nested fields, but none seem to concern the case where we're looking for 2 specific things in 1 nested field.
Your issue is that author is a list.
You can use an aggregate query to unwind this list to objects, and then your query would work:
cur = works.aggregate([{'$unwind': '$author'},
{'$match': {'author.family':'Einstein', 'author.given':'Albert'}}])
Alternatively, use $elemMatch which matches on arrays that match all the elements specified.
cur = works.find({"author": {'$elemMatch': {'family': 'Einstein', 'given': 'Albert'}}})
Also consider using multikey indexes.

Firebase is only responding with a single document when I query it even though multiple meet the query criteria

I'm trying to write a cloud function that returns users near a specific location. get_nearby() returns a list of tuples containing upper and lower bounds for a geohash query, and then this loop should query firebase for users within those geohashes.
user_ref = db.collection(u'users')
db_response = []
for query_range in get_nearby(lat, long, radius):
query = user_ref.where(u'geohash', u'>=', query_range[0]).where(u'geohash', u'<=', query_range[1]).get()
for el in query:
db_response.append(el.to_dict())
For some reason when I run this code, it returns only one document from my database, even though there are three other documents with the same geohash as that one. I know the documents are there, and they do get returned when I request the entire collection. What am I missing here?
edit:
The database currently has 4 records in it, 3 of which should be returned in this query:
{
{name: "Trevor", geohash: "dnvtz"}, #this is the one that gets returned
{name: "Test", geohash: "dnvtz"},
{name: "Test", geohash: "dnvtz"}
}
query_range is a tuple with two values. A lower and upper bound geohash. In this case, it's ("dnvt0", "dnvtz").
I decided to clear all documents from my database and then generate a new set of sample data to work with (everything there was only for testing anyway, nothing important). After pushing the new data to Firestore, everything is working. My only assumption is that even though the strings matched up, I'd used the wrong encoding on some of them.

psycopg2 each execute yields different table and cursor not scrolling to the beginning

I am querying a postgresql database through python's psycopg2 package.
In short: The problem is psycopg2.fetchmany() yields a different table everytime I run a psydopg2.cursor.execute() command.
import psycopg2 as ps
conn = ps.connect(database='database', user='user')
nlines = 1000
tab = "some_table"
statement= """ SELECT * FROM """ + tab + " LIMIT %d;" %(nlines)
crs.execute(statement)
then I fetch the data in pieces. Running the following executes just fine and each time when I scroll back to the beginning I get the same results.
rec=crs.fetchmany(10)
crs.scroll(0, mode='absolute')
print rec[-1][-2]
However, if I run the crs.execute(statement) again and then fetch the data, it yields a completely different output. I tried running ps.connect again, do conn.rollback(), conn.reset(), crs.close() and nothing ever resulted in consisted output from the table. I also tried a named cursor with scrollable enabled
crs= conn.cursor(name="cur1")
crs.scrollable=1
...
crs.scroll(0, mode= 'absolute')
still no luck.
You don't have any ORDER BY clause in your query, and Postgres does not guarantee any particular ordering without one. It's particularly likely to change ordering for tables which have lots of churn (i.e. lots of inserts/updates/deletes).
See the Postgres SELECT doc for more details, but the most salient snippet here is this:
If the ORDER BY clause is specified, the returned rows are sorted in
the specified order. If ORDER BY is not given, the rows are returned
in whatever order the system finds fastest to produce.
I wouldn't expect any query, regardless of the type of cursor used, to necessarily return the exact same result set given a query of this type.
What happens when you add an explicit ORDER BY?

PYMONGO - How do I use the query $in operator with MongoIDs?

So I am trying to use the $in operator in Pymongo where I want to search with a bunch of MongoIDs.
First I have this query to find an array of MongoIDs:
findUsers = db.users.find_one({'_id':user_id},{'_id':0, 'f':1})
If I print the findUsers['f'] it looks like this:
[ObjectId('53b2dc0b24c4310292e6def5'), ObjectId('53b6dbb654a7820416a12767')]
These object IDs are user ids and what I want to do is to find all users that are in the users collection with this array of ObjectID. So my thought was this:
foundUsers = db.users.find({'_id':{'$in':findUsers['f']}})
However when I print the foundUsers the outcome is this:
<pymongo.cursor.Cursor object at 0x10d972c50>
which is not what I normally get when I print a query out :(
What am I doing wrong here?
Many thanks.
Also just for you reference, I have queried in the mongo shell and it works as expected:
db.users.find({_id: {$in:[ObjectId('53b2dc0b24c4310292e6def5'), ObjectId('53b6dbb654a7820416a12767')]}})
You are encountering the difference between findOne() and find() in MongoDB. findOne returns a single document. find() returns a mongoDB cursor. Normally you have to iterate over the cursor to show the results. The reason your code works in the mongo shell is that the mongo shell treats cursors differently if they return 20 documents or less - it handles iterating over the cursor for you:
Cursors
In the mongo shell, the primary method for the read operation is the
db.collection.find() method. This method queries a collection and
returns a cursor to the returning documents.
To access the documents, you need to iterate the cursor. However, in
the mongo shell, if the returned cursor is not assigned to a variable
using the var keyword, then the cursor is automatically iterated up to
20 times [1] to print up to the first 20 documents in the results.
http://docs.mongodb.org/manual/core/cursors/
The pymongo manual page on iterating over cursors would probably be a good place to start:
http://api.mongodb.org/python/current/api/pymongo/cursor.html
but here's a piece of code that should illustrate the basics for you. After your call to find() run this:
for doc in findUsers:
print(doc)

Memory efficient way of fetching postgresql uniqueue dates?

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

Categories