I have been working with couchdb module in python to meet some projects needs. I was happily using view method from couchdb to retrieve result sets from my database until recently.
for row in db.view(mapping_function):
print row.key
However lately I have been needing to work with databases a lot bigger in size than before (~ 15-20 Gb). This is when I ran into an unfortunate issue.
db.view() method loads all rows in memory before you can do anything with it. This is not an issue with small databases but a big problem with large databases.
That is when I came across iterview function. This looks promising but I couldn't find a example usage of it. Can someone share or point me to example usage of iteview function in python-couchdb
Thanks - A
Doing this is almost working for me:
import couchdb.client
server = couchdb.client.Server()
db = server['db_name']
for row in db.iterview('my_view', 10, group=True):
print row.key + ': ' + row.value
I say it almost works because it does return all of the data and all the rows are printed. However, at the end of the batch, it throws a KeyError exception inside couchdb/client.py (line 884) in iterview
This worked for me. You need to add include_docs=True to the iterview call, and then you will get a doc attribute on each row which can be passed to the database delete method:
import couchdb
server = couchdb.Server("http://127.0.0.1:5984")
db = server['your_view']
for row in db.iterview('your_view/your_view', 10, include_docs=True):
# print(type(row))
# print(type(row.doc))
# print(dir(row))
# print(row.id)
# print(row.keys())
db.delete(row.doc)
Related
I am trying to implement Pull Model to query change feed using Azure Cosmos Python SDK. I found that to parallelise the querying process, the official documentation mentions about FeedRange value and create FeedIterator to iterate through each range of partition key values obtained from the FeedRange.
Currently my code snippet to query change feed looks like this and it is pretty straight-forward:
# function to get items from change feed based on a condition
def get_response(container_client, condition):
# Historical data read
if condition:
response = container.query_items_change_feed(
is_start_from_beginning = True,
# partition_key_range_id = 0
)
# reading from a checkpoint
else:
response = container.query_items_change_feed(
is_start_from_beginning = False,
continuation = last_continuation_token
)
return response
The problem with this approach is the efficiency when getting all the items from beginning (Historical Data Read). I tried this method with pretty small dataset of 500 items and the response took around 60 seconds. When dealing with millions or even billions of items the response might take too long to return.
Would querying change feed parallelly for each partition key range save time?
If yes, how to get PartitionKeyRangeId in Python SDK?
Is there any problems I need to consider when implementing this?
I hope I make sense!
I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).
If I do something like this:
from py2neo import Graph
graph = Graph()
stuff = graph.cypher.execute("""
match (a:Article)-[p]-n return a, n, p.weight
""")
on a database with lots of articles and links, the query takes a long time and uses all my system's memory, presumably because it's copying the entire result set into memory in one go. Is there some kind of cursor-based version where I could iterate through the results one at a time without having to have them all in memory at once?
EDIT
I found the stream function:
stuff = graph.cypher.stream("""
match (a:Article)-[p]-n return a, n, p.weight
""")
which seems to be what I want according to the documentation but now I get a timeout error (py2neo.packages.httpstream.http.SocketError: timed out), followed by the server becoming unresponsive until I kill it using kill -9.
Have you tried implementing a paging mechanism? Perhaps with the skip keyword: http://neo4j.com/docs/stable/query-skip.html
Similar to using limit / offset in a postgres / mysql query.
EDIT: I previously said that the entire result set was stored in memory, but it appears this is not the case when using api streaming - per Nigel's (Neo engineer) comment below.
I'm writing a method to update several fields in multiple instances in my database. For now, I'm trying to get it to work just for one.
My user uploads a CSV file with all the information to change (including the pk). I've written the function that parses all the information, and this all works fine. I can even assign the data to an item, and if I print it from that function, it comes out correctly. However, when I save the updates (using item.save()) nothing seems to change in the database.
Here's a very stripped down version of the method. I really don't know why it isn't working. I've done something very similar in other spots (getting data through a form, setting the field, calling save, and then displaying the changed information), and I've used a very similar CSV uploading technique to create new entries.
Small piece of relevant code:
reader = csv.reader(f)
for row in reader:
pk = row[0]
print(pk)
item = POObject.objects.get(pk=pk)
p2 = item.purchase2
print item.purchase.requested_start_date
print p2.requested_start_date
requested_start_date=row[6]
requested_start_date = datetime.datetime.strptime(requested_start_date, "%d %b %y")
print requested_start_date
p2.requested_start_date = requested_start_date
p2.save()
print p2.requested_start_date
item.purchase2 = p2
item.save()
print item.purchase.requested_start_date
return pk
Obviously I have lots of prints in there to find where stuff went wrong. Basically what I find is that if I look at item, it looks fine, but if I query the server again (after saving) i.e. dong item2=POObject.objects.get(pk=pk) it won't have had any updates. Does anyone have any idea why save() isn't doing anything?
UPDATE:
The mystery continues.
If I update a field that isn't contained within an FK relation (say, a text field or something), everything seems to work fine. However, what I really need to do is update an item, and then set that item as the fk relation to the main item in question. I'm not sure why this isn't working in the normal way (updating the internal item, saving it, and then setting the fk to that new, updated item).
Whoa. Feel a little ashamed that I didn't figure this out. I had forgotten exactly how I had designed one of my models, and there was another object within it that needed to be updated, but I wasn't saving it.
I'm trying to use GeoModel python module to quickly access geospatial data for my Google App Engine app.
I just have a few general questions for issues I'm running into.
There's two main methods, proximity_fetch and bounding_box_fetch, that you can use to return queries. They actually return a result set, not a filtered query, which means you need to fully prepare a filtered query before passing it in. It also limits you from iterating over the query set, since the results are fetched, and you don't have the option to input an offset into the fetch.
Short of modifying the code, can anyone recommend a solution for specifying an offset into the query? My problem is that I need to check each result against a variable to see if I can use it, otherwise throw it away and test the next. I may run into cases where I need to do an additional fetch, but starting with an offset.
You can also work directly with the location_geocells of your model.
from geospatial import geomodel, geocell, geomath
# query is a db.GqlQuery
# location is a db.GeoPt
# A resolution of 4 is box of environs 150km
bbox = geocell.compute_box(geocell.compute(geo_point.location, resolution=4))
cell = geocell.best_bbox_search_cells (bbox, geomodel.default_cost_function)
query.filter('location_geocells IN', cell)
# I want only results from 100kms.
FETCHED=200
DISTANCE=100
def _func (x):
x.dist = geomath.distance(geo_point.location, x.location)
return x.dist
results = sorted(query.fetch(FETCHED), key=_func)
results = [x for x in results if x.dist <= DISTANCE]
There's no practical way to do this, because a call to geoquery devolves into multiple datastore queries, which it merges together into a single result set. If you were able to specify an offset, geoquery would still have to fetch and discard all the first n results before returning the ones you requested.
A better option might be to modify geoquery to support cursors, but each query would have to return a set of cursors, not a single one.