PyMongo -- cursor iteration - python

I've recently started testing MongoDB via shell and via PyMongo. I've noticed that returning a cursor and trying to iterate over it seems to bottleneck in the actual iteration. Is there a way to return more than one document during iteration?
Pseudo code:
for line in file:
value = line[a:b]
cursor = collection.find({"field": value})
for entry in cursor:
(deal with single entry each time)
What I'm hoping to do is something like this:
for line in file
value = line[a:b]
cursor = collection.find({"field": value})
for all_entries in cursor:
(deal with all entries at once rather than iterate each time)
I've tried using batch_size() as per this question and changing the value all the way up to 1000000, but it doesn't seem to have any effect (or I'm doing it wrong).
Any help is greatly appreciated. Please be easy on this Mongo newbie!
--- EDIT ---
Thank you Caleb. I think you've pointed out what I was really trying to ask, which is this: is there any way to do a sort-of collection.findAll() or maybe cursor.fetchAll() command, as there is with the cx_Oracle module? The problem isn't storing the data, but retrieving it from the Mongo DB as fast as possible.
As far as I can tell, the speed at which the data is returned to me is dictated by my network since Mongo has to single-fetch each record, correct?

Have you considered an approach like:
for line in file
value = line[a:b]
cursor = collection.find({"field": value})
entries = cursor[:] # or pull them out with a loop or comprehension -- just get all the docs
# then process entries as a list, either singly or in batch
Alternately, something like:
# same loop start
entries[value] = cursor[:]
# after the loop, all the cursors are out of scope and closed
for value in entries:
# process entries[value], either singly or in batch
Basically, as long as you have RAM enough to store your result sets, you should be able to pull them off the cursors and hold onto them before processing. This isn't likely to be significantly faster, but it will mitigate any slowdown specifically of the cursors, and free you to process your data in parallel if you're set up for that.

You could also try:
results = list(collection.find({'field':value}))
That should load everything right into RAM.
Or this perhaps, if your file is not too huge:
values = list()
for line in file:
values.append(line[a:b])
results = list(collection.find({'field': {'$in': values}}))

toArray() might be a solution.
Based on the docs, it first iterates all over the cursors on Mongo and only returns the results once, in the form of an array.
http://docs.mongodb.org/manual/reference/method/cursor.toArray/
This is unlike list(coll.find()) or [doc for doc in coll.find()], which fetch one document to Python at a time and goes back to Mongo and fetch the next cursor.
However, this method was not implemented on pyMongo... strange

Like mentioned above by #jmelesky, i always follow same kindof method. Here is my sample code. For storing my cursor twts_result, declaring list below to copy. Make use of RAM if you can to store the data. This solve cursor timeout problem if no processing and updation needded over your collection from where u fetched the data.
Here i am fetching tweets from collection.
twts_result = maindb.economy_geolocation.find({}, {'_id' : False})
print "Tweets for processing -> %d" %(twts_result.count())
tweets_sentiment = []
batch_tweets = []
#Copy the cursor data into list
tweets_collection = list(twts_result[:])
for twt in tweets_collection:
#do stuff here with **twt** data

Related

Inserting rows while looping over result set

I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.

Adding REST resources to database. Struggling to loop through enumerated list and commit changes.

I'm trying to extract data from an API that gives me data back in JSON format. I'm using SQLalchemy and simplejson in a python script to achieve this. The database is PostgreSQL.
I have a class called Harvest, it specifies the details for the table harvest.
Here is the code I suspect is incorrect.
def process(self):
req = urllib2.Request('https://api.fulcrumapp.com/api/v2/records/', headers={"X-ApiToken":"****************************"})
resp = urllib2.urlopen(req)
data = simplejson.load(resp)
for i, m in enumerate(data['harvest']):
harvest = Harvest(m)
self.session.add(harvest)
self.session.commit()
Is there something wrong with this loop? Nothing is going through to the database.
I suspect that if there is anything wrong with the loop, it is that the loop is getting skipped. One thing you can do to verify this is:
ALTER USER application_user SET log_statements='all';
Then the statements will show up in your logs. When you are done:
ALTER USER application_user RESET log_statements;
This being said one thing I see in your code that may cause trouble later is the fact that you are committing per line. This will cause extra disk I/O. You probably want to commit after the loop.

How do I tell if the returned cursor is the last cursor in App Engine

I apologize if I am missing something really obvious.
I'm making successive calls to app engine using cursors. How do I tell if the I'm on the last cursor? The current way I'm doing it now is to save the last cursor and then testing to see if that cursor equals the currently returned cursor. This requires an extra call to the datastore which is probably unnecessary though.
Is there a better way to do this?
Thanks!
I don't think there's a way to do this with ext.db in a single datastore call, but with ndb it is possible. Example:
query = Person.query(Person.name == 'Guido')
result, cursor, more = query.fetch_page(10)
If using the returned cursor will result in more records, more will be True. This is done smartly, in a single RPC call.
Since you say 'last cursor' I assume you are using cursors for some kind of pagination, which implies you will be fetching results in batches with a limit.
In this case then you know you are on the last cursor when you have less results returned than your limit.
limit = 100
results = Entity.all().with_cursor('x').fetch(limit)
if len(results)<limit:
# then there's no point trying to fetch another batch after this one
If you mean "has this cursor hit the end of the search results", then no, not without picking the cursor up and trying it again. If more entities are added that match the original search criteria, such that they logically land "after" the cursor (e.g., a query that sorts by an ascending timestamp), then reusing that saved cursor will let you retrieve those new entities.
I use the same technique Chris Familoe describes, but set the limit 1 more than I wish to return. So, in Chris' example, I would fetch 101 entities. 101 returned means I have another page with at least 1 on.
recs = db_query.fetch(limit + 1, offset)
# if less records returned than requested, we've reached the end
if len(recs) < limit + 1:
lastpage = True
entries = recs
else:
lastpage = False
entries = recs[:-1]
I know this post is kind of old but I was looking for a solution to the same problem. I found it in this excellent book:
http://shop.oreilly.com/product/0636920017547.do
Here is the tip:
results = query.fetch(RESULTS_FOR_PAGE)
new_cursor = query.cursor()
query.with_cursor(new_cursor)
has_more_results = query.count(1) == 1

speed up calling lot of entities, and getting unique values, google app engine python

OK this is a 2 part question, I've seen and searched for several methods to get a list of unique values for a class and haven't been practically happy with any method so far.
So anyone have a simple example code of getting unique values for instance for this code. Here is my super slow example.
class LinkRating2(db.Model):
user = db.StringProperty()
link = db.StringProperty()
rating2 = db.FloatProperty()
def uniqueLinkGet(tabl):
start = time.time()
dic = {}
query = tabl.all()
for obj in query:
dic[obj.link]=1
end = time.time()
print end-start
return dic
My second question is calling for instance an iterator instead of fetch slower? Is there a faster method to do this code below? Especially if the number of elements called be larger than 1000?
query = LinkRating2.all()
link1 = 'some random string'
a = query.filter('link = ', link1)
adic ={}
for itema in a:
adic[itema.user]=itema.rating2
1) One trick to make this query fast is to denormalize your data. Specifically, create another model which simply stores a link as the key. Then you can get a list of unique links by simply reading everything in that table. Assuming that you have many LinkRating2 entities for each link, then this will save you a lot of time. Example:
class Link(db.Model):
pass # the only data in this model will be stored in its key
# Whenever a link is added, you can try to add it to the datastore. If it already
# exists, then this is functionally a no-op - it will just overwrite the old copy of
# the same link. Using link as the key_name ensures there will be no duplicates.
Link(key_name=link).put()
# Get all the unique links by simply retrieving all of its entities and extracting
# the link field. You'll need to use cursors if you have >1,000 entities.
unique_links = [x.key().name() for Link.all().fetch(1000)]
Another idea: If you need to do this query frequently, then keep a copy of the results in memcache so you don't have to read all of this data from the datastore all the time. A single memcache entry can only store 1MB of data, so you may have to split your links data into chunks to store it in memcache.
2) It is faster to use fetch() instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.

SQLAlchemy: Scan huge tables using ORM?

I am currently playing around with SQLAlchemy a bit, which is really quite neat.
For testing I created a huge table containing my pictures archive, indexed by SHA1 hashes (to remove duplicates :-)). Which was impressingly fast...
For fun I did the equivalent of a select * over the resulting SQLite database:
session = Session()
for p in session.query(Picture):
print(p)
I expected to see hashes scrolling by, but instead it just kept scanning the disk. At the same time, memory usage was skyrocketing, reaching 1GB after a few seconds. This seems to come from the identity map feature of SQLAlchemy, which I thought was only keeping weak references.
Can somebody explain this to me? I thought that each Picture p would be collected after the hash is written out!?
Okay, I just found a way to do this myself. Changing the code to
session = Session()
for p in session.query(Picture).yield_per(5):
print(p)
loads only 5 pictures at a time. It seems like the query will load all rows at a time by default. However, I don't yet understand the disclaimer on that method. Quote from SQLAlchemy docs
WARNING: use this method with caution; if the same instance is present in more than one batch of rows, end-user changes to attributes will be overwritten.
In particular, it’s usually impossible to use this setting with eagerly loaded collections (i.e. any lazy=False) since those collections will be cleared for a new load when encountered in a subsequent result batch.
So if using yield_per is actually the right way (tm) to scan over copious amounts of SQL data while using the ORM, when is it safe to use it?
here's what I usually do for this situation:
def page_query(q):
offset = 0
while True:
r = False
for elem in q.limit(1000).offset(offset):
r = True
yield elem
offset += 1000
if not r:
break
for item in page_query(Session.query(Picture)):
print item
This avoids the various buffering that DBAPIs do as well (such as psycopg2 and MySQLdb). It still needs to be used appropriately if your query has explicit JOINs, although eagerly loaded collections are guaranteed to load fully since they are applied to a subquery which has the actual LIMIT/OFFSET supplied.
I have noticed that Postgresql takes almost as long to return the last 100 rows of a large result set as it does to return the entire result (minus the actual row-fetching overhead) since OFFSET just does a simple scan of the whole thing.
You can defer the picture to only retrieve on access. You can do it on a query by query basis.
like
session = Session()
for p in session.query(Picture).options(sqlalchemy.orm.defer("picture")):
print(p)
or you can do it in the mapper
mapper(Picture, pictures, properties={
'picture': deferred(pictures.c.picture)
})
How you do it is in the documentation here
Doing it either way will make sure that the picture is only loaded when you access the attribute.

Categories