Efficient way to get data from lotus notes view - python

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?

It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/

Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.

I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

Related

Querying XML with Azure Databricks

I currently have a console app which strips down xml files using xpath.
I wanted to recreate this in databricks in order to speed up processing time. I'm processing around 150k xml files and it takes around 4 hours with the console app
This is a segment of the console app(I have around another 30 xpath conditions though all similar to below). As you can see one XML file can contain multiple "shareclass" elements which means one row of data per shareclass.
XDocument xDoc = XDocument.Parse(xml);
IEnumerable<XElement> elList = xDoc.XPathSelectElements("/x:Feed/x:AssetOverview/x:ShareClasses/x:ShareClass", nm);
foreach (var el in elList)
{
DataRow dr = dt.NewRow();
dr[0] = el.Attribute("Id")?.Value;
dr[1] = Path.GetFileNameWithoutExtension(fileName);
dr[2] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Fund Management Company']/x:Code", nm)?.Value;
dr[3] = el.XPathSelectElement("x:Profile/x:ServiceProviders/x:ServiceProvider[#Type='Promoter']/x:Code", nm)?.Value;
dr[4] = el.XPathSelectElement("x:Profile/x:Names/x:Name[#Type='Full']/x:Text[#Language='ENG']", nm)?.Value;
I have replaced this with a python notebook shown below but getting very poor performance with it taking around 8 hours to run. Double my old console app.
First step is to read the xml files into a dataframe as strings
df = spark.read.text(f"abfss://{container}#{storage_account_name}.dfs.core.windows.net/Extracted", wholetext=True)
I then use xpath and explode to do the same actions as the console app. Below is a trimmed section of the code
df2 = df.selectExpr("xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/#Id') id",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Code/text()') FundManagerCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Code/text()') PromoterCode",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Fund Management Company\"]/Name/text()') FundManagerName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/ServiceProviders/ServiceProvider[#Type=\"Promoter\"]/Name/text()') PromoterName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/Profile/Names/Name[#Type=\"Legal\"]/Text[#Language=\"ENG\"]/text()') FullName",
"xpath(value, 'Feed/AssetOverview/ShareClasses/ShareClass/FeesExpenses/Fees/Report/FeeSet/Fee[#Type=\"Total Expense Ratio\"]/Percent/text()') Fees",
).selectExpr("explode_outer(arrays_zip(id,FundManagerCode,PromoterCode,FullName,FundManagerName,PromoterName,ISIN_Code,SEDOL_Code,APIR_Code,Class_A,Class_B,KD_1,KD_2,KD_3,KD_4,AG_1,KD_5,KD_6,KD_7,MIFI_1,MIFI_2,MIFI_3,MIFI_4,MIFI_5,MIFI_6,MIFI_7,MIFI_8,MIFI_9,Prev_Perf,AG_2,MexId,AG_3,AG_4,AG_5,AG_6,AG_7,MinInvestVal,MinInvestCurr,Income,Fees)) shareclass"
).select('shareClass.*')
I feel like there must be a much simpler and quicker way to process this data though I don't have enough knowledge to know what it is.
When I first started re-writing the console app I loaded the data through spark using com.databricks.spark.xml and then tried to use JSON path to do the querying however the spark implementation of json path isn't rich enough to do some of the structure querying you can see above.
Any way anyone can think to speed up my code or do it a completely different way I'm all ears. The XML structure has lots of nesting but doesn't feel like it should be too difficult to flatten these nests as columns. I want to use as much standard functionality as possible without creating my own loops through the structure as I feel like this would be even slower.

How to use iterview function from python couchdb

I have been working with couchdb module in python to meet some projects needs. I was happily using view method from couchdb to retrieve result sets from my database until recently.
for row in db.view(mapping_function):
print row.key
However lately I have been needing to work with databases a lot bigger in size than before (~ 15-20 Gb). This is when I ran into an unfortunate issue.
db.view() method loads all rows in memory before you can do anything with it. This is not an issue with small databases but a big problem with large databases.
That is when I came across iterview function. This looks promising but I couldn't find a example usage of it. Can someone share or point me to example usage of iteview function in python-couchdb
Thanks - A
Doing this is almost working for me:
import couchdb.client
server = couchdb.client.Server()
db = server['db_name']
for row in db.iterview('my_view', 10, group=True):
print row.key + ': ' + row.value
I say it almost works because it does return all of the data and all the rows are printed. However, at the end of the batch, it throws a KeyError exception inside couchdb/client.py (line 884) in iterview
This worked for me. You need to add include_docs=True to the iterview call, and then you will get a doc attribute on each row which can be passed to the database delete method:
import couchdb
server = couchdb.Server("http://127.0.0.1:5984")
db = server['your_view']
for row in db.iterview('your_view/your_view', 10, include_docs=True):
# print(type(row))
# print(type(row.doc))
# print(dir(row))
# print(row.id)
# print(row.keys())
db.delete(row.doc)

Extracting information from unconventional text files? (Python)

I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:
#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)
#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)
#PHASE = 40
...
And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.
My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function
def makeTable(a,b,c):
output = Table()
output['x'] = a
output['y'] = b
output['z'] = c
return output
Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code
fileName_phase = makeTable(a,b,c)
Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.
Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.
This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?
If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.
P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.
I will suggest a way which will be scorned by many but will get your work done.
So apologies to every one.
The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).
So the key point here is that the text in the file is code which means it can be executed.
So you can do something like this
import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()
Now that you have all the content, you have to separate it by phase.
For this we will use the re.split function.
phase_data = re.split("#PHASE = .*\n", content)
Now we have the content of each phase in an array.
Now comes for the part of executing it.
for phase in phase_data:
if len(phase.strip()) == 0:
continue
exec(phase)
table = makeTable(x, y, z) # the x, y and z are defined by the exec.
# do whatever you want with the table.
I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.
But your work seems like a scripting one and I believe this will get your work done.
PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.
To avoid the safety issue of using exec as suggested by #Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:
VARS = 'xyz'
def makeTable(phase):
assert len(phase) >= 3
output = Table()
for i in range(3):
line = [s.strip() for s in phase[i].split('=')]
assert len(line) == 2
var, arr = line
assert var == VARS[i]
assert arr[:10]=='np.array([' and arr[-2:]=='])'
output[var] = np.fromstring(arr[10:-2], sep=',')
return output
and then call
table = makeTable(phase)
instead of
exec(phase)
table = makeTable(x, y, z)
You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...

Query multiple values at a time pymongo

Currently I have a mongo document that looks like this:
{'_id': id, 'title': title, 'date': date}
What I'm trying is to search within this document by title, in the database I have like 5ks items which is not much, but my file has 1 million of titles to search.
I have ensure the title as index within the collection, but still the performance time is quite slow (about 40 seconds per 1000 titles, something obvious as I'm doing a query per title), here is my code so far:
Work repository creation:
class WorkRepository(GenericRepository, Repository):
def __init__(self, url_root):
super(WorkRepository, self).__init__(url_root, 'works')
self._db[self.collection].ensure_index('title')
The entry of the program (is a REST api):
start = time.clock()
for work in json_works: #1000 titles per request
result = work_repository.find_works_by_title(work['title'])
if result:
works[work['id']] = result
end = time.clock()
print end-start
return json_encoder(request, works)
and find_works_by_title code:
def find_works_by_title(self, work_title):
works = list(self._db[self.collection].find({'title': work_title}))
return works
I'm new to mongo and probably I've made some mistake, any recommendation?
You're making one call to the DB for each of your titles. The roundtrip is going to significantly slow the process down (the program and the DB will spend most of their time doing network communications instead of actually working).
Try the following (adapt it to your program's structure, of course):
# Build a list of the 1000 titles you're searching for.
titles = [w["title"] for w in json_works]
# Make exactly one call to the DB, asking for all of the matching documents.
return collection.find({"title": {"$in": titles}})
Further reference on how the $in operator works: http://docs.mongodb.org/manual/reference/operator/query/in/
If after that your queries are still slow, use explain on the find call's return value (more info here: http://docs.mongodb.org/manual/reference/method/cursor.explain/) and check that the query is, in fact, using an index. If it isn't, find out why.

What data is cached during a "select" in sqlite3/Python, and can this be done manually from the start?

Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.

Categories