Performance: look in list or sql query - python

I developed a software with PyQt and sqlite to manage scientific articles. Each article is stored in the sqlite database, and comes from a particular journal.
Sometimes, I need to perform some verifications on the articles of a journal. So I build two lists, one containing the DOI of the articles (a DOI is just a unique id for an article), and one containing booleans, True if the articles are ok, False if the articles are not:
def listDoi(self, journal_abb):
"""Function to get the doi from the database.
Also returns a list of booleans to check if the data are complete"""
list_doi = []
list_ok = []
query = QtSql.QSqlQuery(self.bdd)
query.prepare("SELECT * FROM papers WHERE journal=?")
query.addBindValue(journal_abb)
query.exec_()
while query.next():
record = query.record()
list_doi.append(record.value('doi'))
if record.value('graphical_abstract') != "Empty":
list_ok.append(True)
else:
list_ok.append(False)
return list_doi, list_ok
This function returns the two lists. The lists can contain ~2000 items each. After that, to check if an article is ok, I just check if it is in the two lists.
EDIT: I also need to check if an article is only in list_doi.
So I wonder, because performance matters here: what is faster/better/more economic:
build the two lists, and check if the article is present in the two lists
write the function in another way: checkArticle(doi_article), and the function would perform a SQL query for each article
What about the speed and the space in RAM ? Will the results be different if there are few items, or a lot of them ?

Use time.perf_counter() to determine how long this process takes currently.
time_start = time.perf_counter()
# your code here
print(time.perf_counter() - time_start)
Based on that, if it is going too slow(), you can try each of your options, and time them as well to look for an improvement in performance. As for checking the RAM usage, a simple way is this:
import os
import psutil
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20) # return the memory usage in MB
For a more in-depth memory usage check, look here: https://stackoverflow.com/a/110826/3841261 Always have a way to objectively measure when looking to improve speed/RAM usage/etc.

I would execute one sql query that finds the articles that are OK at once (perhaps in a function called find_articles() or something)
Think of it this way, why do something twice (copy all those rows and work with them) when you could do it once?
You want to basically execute this:
SELECT * from papers where (PAPERID in OTHERTABLE and OTHER RESTRAINT = "WHATEVER")
That's obviously just Pseudocode but I think you can figure it out.

Related

Efficient way to get data from lotus notes view

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

Query multiple values at a time pymongo

Currently I have a mongo document that looks like this:
{'_id': id, 'title': title, 'date': date}
What I'm trying is to search within this document by title, in the database I have like 5ks items which is not much, but my file has 1 million of titles to search.
I have ensure the title as index within the collection, but still the performance time is quite slow (about 40 seconds per 1000 titles, something obvious as I'm doing a query per title), here is my code so far:
Work repository creation:
class WorkRepository(GenericRepository, Repository):
def __init__(self, url_root):
super(WorkRepository, self).__init__(url_root, 'works')
self._db[self.collection].ensure_index('title')
The entry of the program (is a REST api):
start = time.clock()
for work in json_works: #1000 titles per request
result = work_repository.find_works_by_title(work['title'])
if result:
works[work['id']] = result
end = time.clock()
print end-start
return json_encoder(request, works)
and find_works_by_title code:
def find_works_by_title(self, work_title):
works = list(self._db[self.collection].find({'title': work_title}))
return works
I'm new to mongo and probably I've made some mistake, any recommendation?
You're making one call to the DB for each of your titles. The roundtrip is going to significantly slow the process down (the program and the DB will spend most of their time doing network communications instead of actually working).
Try the following (adapt it to your program's structure, of course):
# Build a list of the 1000 titles you're searching for.
titles = [w["title"] for w in json_works]
# Make exactly one call to the DB, asking for all of the matching documents.
return collection.find({"title": {"$in": titles}})
Further reference on how the $in operator works: http://docs.mongodb.org/manual/reference/operator/query/in/
If after that your queries are still slow, use explain on the find call's return value (more info here: http://docs.mongodb.org/manual/reference/method/cursor.explain/) and check that the query is, in fact, using an index. If it isn't, find out why.

Redis as a Queue - Bulk Retrieval

Our Python application serves around 2 million API requests per day. We got a new requirement from our business to generate the report which should contain the count of unique request and response every day.
We would like to use Redis for queuing all the requests & responses.
Another worker instance will retrieve the above data from Redis queue and process it.
The processed results will be persisted to the database.
The simplest option is to use LPUSH and RPOP. But RPOP will return one value at a time which will affect the performance. Is there any way to do a bulk pop from Redis?
Other suggestions for the scenario would be highly appreciated.
A simple solution would be to use redis pipelining
In a single request you will be allowed to perform multiple RPOP instructions.
Most of redis drivers support it. In python with Redis-py it looks like this:
pipe = r.pipeline()
# The following RPOP commands are buffered
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
pipe.rpop('requests')
# the EXECUTE call sends all buffered commands to the server, returning
# a list of responses, one for each command.
pipe.execute()
Can approach this from a different angle. Your requirement is:
requirement ... to generate the report which should contain the count of unique request and response every day.
Rather than storing requests in the lists and then post-processing the results, why not use Redis features to solve the actual requirements and avoid the problem of bulk LPUSH/LPOP.
If all we want if to record the unique counts, then you may want to consider using sorted sets.
This may go like this:
Collect the request statistics
# Collect the request statistics in the sorted set.
# The key includes date so we can do the "by date" stats.
key = 'requests:date'
r.zincrby(key, request, 1)
Report request statistics
Can use ZSCAN to iterate over all members in batches, but this is unordered.
Can use ZRANGE to get all members in one go (or whatever), ordered.
Python code:
# ZSCAN: Iterate over all members in the set in batches of about 10.
# This will be unordered list.
# zscan_iter returns tuples (member, score)
batchSize = 10
for memberTuple in r.zscan_iter(key, match = None, count = batchSize):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
# ZRANGE: Get all members in the set, ordered by score.
# Here there maxRank=-1 means "no max".
minRank = 0
maxRank = -1
for memberTuple in r.zrange(key, minRank, maxRank, desc = False, withscores = True):
member = memberTuple[0]
score = memberTuple[1]
print str(member) + ' --> ' + str(score)
Benefits of this approach
Solves the actual requirement - reports on the count of unique requests by day.
No need to post-process anything.
Can do additional queries like "top requests" out of the box :)
Another approach would be to use the Hyperloglog data structure.
It was especially designed for this kind of use case.
It allows counting unique items with a low error margin (0.81%) and with a very low memory usage.
Using HLL is really simple:
PFADD myHll "<request1>"
PFADD myHll "<request2>"
PFADD myHll "<request3>"
PFADD myHll "<request4>"
Then to get the count:
PFCOUNT myHll
The actual question was regarding Redis List, You can use lrange to get all values in a single call, below is solution;
import redis
r_server = redis.Redis("localhost")
r_server.rpush("requests", "Adam")
r_server.rpush("requests", "Bob")
r_server.rpush("requests", "Carol")
print r_server.lrange("requests", 0, -1)
print r_server.llen("requests")
print r_server.lindex("requests", 1)

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

GeoModel with Google App Engine - queries

I'm trying to use GeoModel python module to quickly access geospatial data for my Google App Engine app.
I just have a few general questions for issues I'm running into.
There's two main methods, proximity_fetch and bounding_box_fetch, that you can use to return queries. They actually return a result set, not a filtered query, which means you need to fully prepare a filtered query before passing it in. It also limits you from iterating over the query set, since the results are fetched, and you don't have the option to input an offset into the fetch.
Short of modifying the code, can anyone recommend a solution for specifying an offset into the query? My problem is that I need to check each result against a variable to see if I can use it, otherwise throw it away and test the next. I may run into cases where I need to do an additional fetch, but starting with an offset.
You can also work directly with the location_geocells of your model.
from geospatial import geomodel, geocell, geomath
# query is a db.GqlQuery
# location is a db.GeoPt
# A resolution of 4 is box of environs 150km
bbox = geocell.compute_box(geocell.compute(geo_point.location, resolution=4))
cell = geocell.best_bbox_search_cells (bbox, geomodel.default_cost_function)
query.filter('location_geocells IN', cell)
# I want only results from 100kms.
FETCHED=200
DISTANCE=100
def _func (x):
x.dist = geomath.distance(geo_point.location, x.location)
return x.dist
results = sorted(query.fetch(FETCHED), key=_func)
results = [x for x in results if x.dist <= DISTANCE]
There's no practical way to do this, because a call to geoquery devolves into multiple datastore queries, which it merges together into a single result set. If you were able to specify an offset, geoquery would still have to fetch and discard all the first n results before returning the ones you requested.
A better option might be to modify geoquery to support cursors, but each query would have to return a set of cursors, not a single one.

Categories