SPARQL initialization

SPARQL initialization - python

I use SPARQLWrapper in python to query a web enpoint with many different querys in a loop.
So I tried to make it work like this (let queries hold all different queries and result the results):
sparql = SPARQLWrapper("url")
prefix = "prefix..."
for i in arange(1:len(queries)):
sparql.setQuery(prefix+queries[i])
result[i] = sparql.query().convert()
But this does not work. The first query I pick from the list would return the supposed result, but any other querys wouldn't.
Instead of that, I now use this:
for i in arange(1:len(queries)):
[sparql,prefix] = initializeSPARQL()
sparql.setQuery(prefix+queries[i])
result[i] = sparql.query().convert()
and also
def initializeSPARQL():
sparql = SPARQLWrapper("url")
prefix = "prefix..."
return sparql,prefix
That works and is also not an issue of performance, since the querying itself is the bottleneck. But is there a better solution? This appears to be so wrong...

It is strange.. because I've been checking the code, and the query() method is completely stateless, so no idea why it's failing.
With i > 1, what does result[i] contains?
May I suggest you to try the following?
sparql = SPARQLWrapper("url")
prefix = "prefix..."
results = []
for i in range(0, len(queries)):
sparql.resetQuery()
sparql.setQuery(prefix+queries[i])
results[i] = sparql.query().convert()

I'm one of the developers of the library.
Your first try arises a bug. I'll check what internal data structure keeps with the previous usage to allow such way to use the library.
You second solution, even if is works, should be not the right way to do it.
As I said, I'll take a look on how to fix this.
For the future, please, submit a proper bug report to the project or an email to the mailing list.

Related

How can i reduce my sqlite3 code using variables?

I got around 500 lines of sqlite code and i am trying to reduce that number. So i tried making a function with variables.
I tried using this code:
x = "c.execute("
o = "'"
y = "'INSERT INTO "
unio = x+y
tabl = ["d","d_2","n_d","n_d_2"]
val = "VALUES ("
w = ") "
k = "?,?"
mark = 369
ek = 963
def db_up(table,):
aa = y+table+"("+tabl[0] + ", " + tabl[1]+w+val+k+")"+"',"+"(mark,ek,)"
bb = unio+table+"("+tabl[1]+ ", " + tabl[1]+w+val+k+")"+"',"+"(mark,ek,))"
print(aa) # 'INSERT INTO avg_dt(d, d_2) VALUES (?,?)',(mark,ek,)
print(bb) # c.execute('INSERT INTO avg_dt(d_2, d_2) VALUES (?,?)',(mark,ek,))
c.execute(str(aa)) # no succes
c.execute(aa) # no success
bb # no success
When i run the "c.execute(aa)" line. it throws me this error:
sqlite3.OperationalError: near "'INSERT INTO avg_dt(d, d_2) VALUES (?,?)'": syntax error
So...how can i make a sqlite3 code using variables a functions?
Thanks for taking the time ;)

I can't post a comment with my current rep, so let me try to provide a full answer.
Your approach seems to be that you have too much SQL code, which you consider a problem, so you're trying to "compress" the code by reducing common statements into shorter variable names. There are a number of reasons you should avoid doing this, and a number of better ways to achieve a similar result.
First, let's talk about why this is a bad idea:
Your code doesn't become more maintainable. Interspersing cryptic variable names like x, o, y, or unio doesn't make the code any easier to read, even for the author, assuming a few days have passed since you wrote it.
Using this kind of method doesn't make your code any more performant, and most likely makes it less performant: your program now has to worry about allocating and reallocating memory when doing string interpolation or concatenation, which takes cycles.
Doing string interpolation or concatenation in SQL should be done with extreme caution: this is essentially a homebrew version of prepared statements, which are usually authored by people with loads of experience doing SQL programming. Making this on your own risks your program being targetted by SQL injection (or at least column/value type mismatches).
Now, let's talk about mitigating this issue for you:
Lots of SQL code need not be unmanageable: typically, if you must maintain large amounts of raw SQL in your project, you dump that into a separate SQL file which you can then either execute directly from the database CLI or run from your program via a database driver.
ORMs are (usually) your friend: with the exception of exotic or outstandingly performance-sensitive SQL queries, modern ORMs can get rid of most raw SQL code in your program. The more naturally programmatic structure of ORM code also means you can break it up into different delegate functions to avoid code reuse, to an extent.
Please feel free to add details to your question; as it stands, it's not totally clear whether your concerns can be addressed with this answer.

Efficient way to get data from lotus notes view

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?

It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/

Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.

I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

py2neo: depending batch insertion

I use py2neo (v 1.9.2) to write data to a neo4j db.
batch = neo4j.WriteBatch(graph_db)
current_relationship_index = graph_db.get_or_create_index(neo4j.Relationship, "Current_Relationship")
touched_relationship_index = graph_db.get_or_create_index(neo4j.Relationship, "Touched_Relationship")
get_rel = current_relationship_index.get(some_key1, some_value1)
if len(get_rel) == 1:
batch.add_indexed_relationship(touched_relationship_index, some_key2, some_value2, get_rel[0])
elif len(get_rel) == 0:
created_rel = current_relationship_index.create(some_key3, some_value3, (my_start_node, "KNOWS", my_end_node))
batch.add_indexed_relationship(touched_relationship_index, some_key4, "touched", created_rel)
batch.submit()
Is there a way to replace current_relationship_index.get(..) and current_relationship_index.create(...) with a batch command? I know that there is one, but the problem is, that I need to act depending on the return of these commands. And I would like to have all statements in a batch due to performance.
I have read that it is rather uncommon to index relationships but the reason I do it is the following: I need to parse some (text) file everyday and then need to check if any of the relations have changed towards the previous day, i.e. if a relation does not exist in the text file anymore I want to mark it with a "replaced" property in neo4j. Therefore, I add all "touched" relationships to the appropriate index, so I know that these did not change. All relations that are not in the touched_relationship_index obviously do not exist anymore so I can mark them.
I can't think of an easier way to do so, even though I'm sure that py2neo offers one.
EDIT: Considering Nigel's comment I tried this:
my_rel = batch.get_or_create_indexed_relationship(current_relationship_index, some_key, some_value, my_start_node, my_type, my_end_node)
batch.add_indexed_relationship(touched_relationship_index, some_key2, some_value2, my_rel)
batch.submit()
This obviously does not work, because i can't refer to "my_rel" in the batch. How can I solve this? Refer with "0" to the result of the previous batch statement? But consider that the whole thing is supposed to run in a loop, so the numbers are not fixed. Maybe use some variable "batch_counter" which refers to the current batch statement and is always incremented, whenever a statement is added to the batch??

Have a look at WriteBatch.get_or_create_indexed_relationship. That can conditionally create a relationship based on whether or not one currently exists and operates atomically. Documentation link below:
http://book.py2neo.org/en/latest/batches/#py2neo.neo4j.WriteBatch.get_or_create_indexed_relationship
There are a few similar uniqueness management facilities in py2neo that I recently blogged about here that you might want to read about.

Efficient processing of email list in Python

I have a very long list of emails that I would like to process to:
separate good emails from bad emails, and
remove duplicates but keep all the non-duplicates in the same order.
This is what I have so far:
email_list = ["joe#example.com", "invalid_email", ...]
email_set = set()
bad_emails = []
good_emails = []
dups = False
for email in email_list:
if email in email_set:
dups = True
continue
email_set.add(email)
if email_re.match(email):
good_emails.append(email)
else:
bad_emails.append(email)
I would like this chunk of code to be as fast as possible, and of less importance, to minimize memory requirements. Is there a way to improve this in Python? Maybe using list comprehensions or iterators?
EDIT: Sorry! Forget to mention that this is Python 2.5 since this is for GAE.
email_re is from django.core.validators

Look at: Does Python have an ordered set? , and select an implementation you like.
So just:
email_list = OrderedSet(["joe#example.com", "invalid_email", ...])
bad_emails = []
good_emails = []
for email in email_list:
if email_re.match(email):
good_emails.append(email)
else:
bad_emails.append(email)
Probably is the fastest and simpliest solution you can achieve.

I can't think of any way to speed up what you have. It's fast to use a set to keep track of things, and it's fast to use a list to store a list.
I like the OrderedSet solution, but I doubt a Python implementation of OrderedSet would be faster than what you wrote.
You could use an OrderedDict to solve this problem. But that was added for Python 2.7. You could use a recipe (like: http://code.activestate.com/recipes/576693/) to add OrderedDict but again I don't think it would be any faster than what you have.
I'm trying to think of a Python module that is implemented in C to solve this problem. I think that's the only hope of beating your code. But I haven't thought of anything.
If you can get rid of the dups flag, it will be faster simply by running less Python code.
Interesting question. Good luck.

GeoModel with Google App Engine - queries

I'm trying to use GeoModel python module to quickly access geospatial data for my Google App Engine app.
I just have a few general questions for issues I'm running into.
There's two main methods, proximity_fetch and bounding_box_fetch, that you can use to return queries. They actually return a result set, not a filtered query, which means you need to fully prepare a filtered query before passing it in. It also limits you from iterating over the query set, since the results are fetched, and you don't have the option to input an offset into the fetch.
Short of modifying the code, can anyone recommend a solution for specifying an offset into the query? My problem is that I need to check each result against a variable to see if I can use it, otherwise throw it away and test the next. I may run into cases where I need to do an additional fetch, but starting with an offset.

You can also work directly with the location_geocells of your model.
from geospatial import geomodel, geocell, geomath
# query is a db.GqlQuery
# location is a db.GeoPt
# A resolution of 4 is box of environs 150km
bbox = geocell.compute_box(geocell.compute(geo_point.location, resolution=4))
cell = geocell.best_bbox_search_cells (bbox, geomodel.default_cost_function)
query.filter('location_geocells IN', cell)
# I want only results from 100kms.
FETCHED=200
DISTANCE=100
def _func (x):
x.dist = geomath.distance(geo_point.location, x.location)
return x.dist
results = sorted(query.fetch(FETCHED), key=_func)
results = [x for x in results if x.dist <= DISTANCE]

There's no practical way to do this, because a call to geoquery devolves into multiple datastore queries, which it merges together into a single result set. If you were able to specify an offset, geoquery would still have to fetch and discard all the first n results before returning the ones you requested.
A better option might be to modify geoquery to support cursors, but each query would have to return a set of cursors, not a single one.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.