Neo4J / py2neo -- cursor-based query? - python

If I do something like this:
from py2neo import Graph
graph = Graph()
stuff = graph.cypher.execute("""
match (a:Article)-[p]-n return a, n, p.weight
""")
on a database with lots of articles and links, the query takes a long time and uses all my system's memory, presumably because it's copying the entire result set into memory in one go. Is there some kind of cursor-based version where I could iterate through the results one at a time without having to have them all in memory at once?
EDIT
I found the stream function:
stuff = graph.cypher.stream("""
match (a:Article)-[p]-n return a, n, p.weight
""")
which seems to be what I want according to the documentation but now I get a timeout error (py2neo.packages.httpstream.http.SocketError: timed out), followed by the server becoming unresponsive until I kill it using kill -9.

Have you tried implementing a paging mechanism? Perhaps with the skip keyword: http://neo4j.com/docs/stable/query-skip.html
Similar to using limit / offset in a postgres / mysql query.
EDIT: I previously said that the entire result set was stored in memory, but it appears this is not the case when using api streaming - per Nigel's (Neo engineer) comment below.

Related

Efficient way to get data from lotus notes view

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

Is updating an array in Elasticsearch document thread safe?

I have a document which has an array of strings as one of it's properties.
There is a case where multiple update queries are executed on the index from two threads and potentially can update the same document.
For example given a document where the value of my_array is:
my_array = [1,2,3,4]
The two threads execute the following update query which runs a painless script on the index with different params.
thread_0 -> my_item=3
thread_1 -> my_item=2
int index_of_my_item = ctx._source.my_array.indexOf(params.my_item);
if (-1 != index_of_my_item) {
ctx._source.my_array.remove(index_of_item);
}
Is the execution thread safe? and the property value in the document will be:
my_array = [1,4]
Or are there race conditions to consider?
Thanks,
ES works on Optimistic concurrency control and it uses version number to implement this, so there is no concept of thread safe (from ES perspective).
Just go through these links and it will give you fair idea what exactly update means for ES
https://www.elastic.co/guide/en/elasticsearch/guide/master/version-control.html
https://www.elastic.co/guide/en/elasticsearch/guide/master/optimistic-concurrency-control.html
When updating a document with the index API, we read the original document, make our changes, and then reindex the whole document in one go. The most recent indexing request wins: whichever document was indexed last is the one stored in Elasticsearch. If somebody else had changed the document in the meantime, their changes would be lost.

How to use iterview function from python couchdb

I have been working with couchdb module in python to meet some projects needs. I was happily using view method from couchdb to retrieve result sets from my database until recently.
for row in db.view(mapping_function):
print row.key
However lately I have been needing to work with databases a lot bigger in size than before (~ 15-20 Gb). This is when I ran into an unfortunate issue.
db.view() method loads all rows in memory before you can do anything with it. This is not an issue with small databases but a big problem with large databases.
That is when I came across iterview function. This looks promising but I couldn't find a example usage of it. Can someone share or point me to example usage of iteview function in python-couchdb
Thanks - A
Doing this is almost working for me:
import couchdb.client
server = couchdb.client.Server()
db = server['db_name']
for row in db.iterview('my_view', 10, group=True):
print row.key + ': ' + row.value
I say it almost works because it does return all of the data and all the rows are printed. However, at the end of the batch, it throws a KeyError exception inside couchdb/client.py (line 884) in iterview
This worked for me. You need to add include_docs=True to the iterview call, and then you will get a doc attribute on each row which can be passed to the database delete method:
import couchdb
server = couchdb.Server("http://127.0.0.1:5984")
db = server['your_view']
for row in db.iterview('your_view/your_view', 10, include_docs=True):
# print(type(row))
# print(type(row.doc))
# print(dir(row))
# print(row.id)
# print(row.keys())
db.delete(row.doc)

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

GeoModel with Google App Engine - queries

I'm trying to use GeoModel python module to quickly access geospatial data for my Google App Engine app.
I just have a few general questions for issues I'm running into.
There's two main methods, proximity_fetch and bounding_box_fetch, that you can use to return queries. They actually return a result set, not a filtered query, which means you need to fully prepare a filtered query before passing it in. It also limits you from iterating over the query set, since the results are fetched, and you don't have the option to input an offset into the fetch.
Short of modifying the code, can anyone recommend a solution for specifying an offset into the query? My problem is that I need to check each result against a variable to see if I can use it, otherwise throw it away and test the next. I may run into cases where I need to do an additional fetch, but starting with an offset.
You can also work directly with the location_geocells of your model.
from geospatial import geomodel, geocell, geomath
# query is a db.GqlQuery
# location is a db.GeoPt
# A resolution of 4 is box of environs 150km
bbox = geocell.compute_box(geocell.compute(geo_point.location, resolution=4))
cell = geocell.best_bbox_search_cells (bbox, geomodel.default_cost_function)
query.filter('location_geocells IN', cell)
# I want only results from 100kms.
FETCHED=200
DISTANCE=100
def _func (x):
x.dist = geomath.distance(geo_point.location, x.location)
return x.dist
results = sorted(query.fetch(FETCHED), key=_func)
results = [x for x in results if x.dist <= DISTANCE]
There's no practical way to do this, because a call to geoquery devolves into multiple datastore queries, which it merges together into a single result set. If you were able to specify an offset, geoquery would still have to fetch and discard all the first n results before returning the ones you requested.
A better option might be to modify geoquery to support cursors, but each query would have to return a set of cursors, not a single one.

Categories