I'm trying to use xpath to query a large html with multiple tables and only extract a few tables that contain a specific pattern in one of the cells. I'm running into time related challenges.
I've tried to minimize by issue as much as possible.
code setup: - creates 10 (300x15) tables with random values between 0-100
import pandas as pd
import numpy as np
dataframes = [pd.DataFrame(np.random.randint(0,100, (300, 15)), columns=[f"Col-{i}" for i in range(15)]) for k in range(10)]
html_strings = [df.to_html() for df in dataframes]
combined_html = '\n'.join(html_strings)
source_html = f'<html><body>{combined_html}</body></html>'
code execution: I want to extract all tables that have the value "80" in them (in this case it will be all 10 tables)
from lxml import etree
root = etree.fromstring(source_html.encode())
PAT = '80' # this should result in returning all 10 tables as 80 will definitely be there in all of them (pandas index)
# method 1: query directly using xpath - this takes a long time to run - and this seems to exhibit exponential time behavior
xpath_query = "//table//*[text() = '{PAT}']/ancestor::table"
tables = root.xpath(xpath_query)
# method 2: this runs in under a second. first get all the tables and then run the same xpath expression within the table context
all_tables = root.xpath('//table')
table_xpath_individual = ".//*[text() = '{PAT}']/ancestor::table"
selected_tables = [table for table in all_tables if table.xpath(table_xpath_individual)]
method 1 takes 40-50s to finish
method 2 takes <1s
I'm not sure whether it's the xpath expression in method 1 that's problematic or it's an lxml issue here. I've switched to using method 2 for now - but unsure whether it's a 100% equivalent behavior
I don't know if this is relevant (but I suspect so). You can simplify these XPaths by eliminating the //* step and the trailing /ancestor::table step.
//table[descendant::text() = '{PAT}']
Note that in your problematic XPath, for each table you will find every descendant element whose text is 80 (there might be many within a table) and for each one, return all of that element's table ancestors (again, because in theory there might be more than one if you had a table containing a table, the XPath processor is going to have to laboriously traverse all those ancestor paths). That will return a potentially large number of results which the XPath processor will then have to deduplicate, so that it doesn't return multiple instances of any given table (an XPath 1.0 nodeset is guaranteed not to contain duplicates).
Related
I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).
I developed a software with PyQt and sqlite to manage scientific articles. Each article is stored in the sqlite database, and comes from a particular journal.
Sometimes, I need to perform some verifications on the articles of a journal. So I build two lists, one containing the DOI of the articles (a DOI is just a unique id for an article), and one containing booleans, True if the articles are ok, False if the articles are not:
def listDoi(self, journal_abb):
"""Function to get the doi from the database.
Also returns a list of booleans to check if the data are complete"""
list_doi = []
list_ok = []
query = QtSql.QSqlQuery(self.bdd)
query.prepare("SELECT * FROM papers WHERE journal=?")
query.addBindValue(journal_abb)
query.exec_()
while query.next():
record = query.record()
list_doi.append(record.value('doi'))
if record.value('graphical_abstract') != "Empty":
list_ok.append(True)
else:
list_ok.append(False)
return list_doi, list_ok
This function returns the two lists. The lists can contain ~2000 items each. After that, to check if an article is ok, I just check if it is in the two lists.
EDIT: I also need to check if an article is only in list_doi.
So I wonder, because performance matters here: what is faster/better/more economic:
build the two lists, and check if the article is present in the two lists
write the function in another way: checkArticle(doi_article), and the function would perform a SQL query for each article
What about the speed and the space in RAM ? Will the results be different if there are few items, or a lot of them ?
Use time.perf_counter() to determine how long this process takes currently.
time_start = time.perf_counter()
# your code here
print(time.perf_counter() - time_start)
Based on that, if it is going too slow(), you can try each of your options, and time them as well to look for an improvement in performance. As for checking the RAM usage, a simple way is this:
import os
import psutil
process = psutil.Process(os.getpid())
print process.get_memory_info()[0] / float(2 ** 20) # return the memory usage in MB
For a more in-depth memory usage check, look here: https://stackoverflow.com/a/110826/3841261 Always have a way to objectively measure when looking to improve speed/RAM usage/etc.
I would execute one sql query that finds the articles that are OK at once (perhaps in a function called find_articles() or something)
Think of it this way, why do something twice (copy all those rows and work with them) when you could do it once?
You want to basically execute this:
SELECT * from papers where (PAPERID in OTHERTABLE and OTHER RESTRAINT = "WHATEVER")
That's obviously just Pseudocode but I think you can figure it out.
Currently I have a mongo document that looks like this:
{'_id': id, 'title': title, 'date': date}
What I'm trying is to search within this document by title, in the database I have like 5ks items which is not much, but my file has 1 million of titles to search.
I have ensure the title as index within the collection, but still the performance time is quite slow (about 40 seconds per 1000 titles, something obvious as I'm doing a query per title), here is my code so far:
Work repository creation:
class WorkRepository(GenericRepository, Repository):
def __init__(self, url_root):
super(WorkRepository, self).__init__(url_root, 'works')
self._db[self.collection].ensure_index('title')
The entry of the program (is a REST api):
start = time.clock()
for work in json_works: #1000 titles per request
result = work_repository.find_works_by_title(work['title'])
if result:
works[work['id']] = result
end = time.clock()
print end-start
return json_encoder(request, works)
and find_works_by_title code:
def find_works_by_title(self, work_title):
works = list(self._db[self.collection].find({'title': work_title}))
return works
I'm new to mongo and probably I've made some mistake, any recommendation?
You're making one call to the DB for each of your titles. The roundtrip is going to significantly slow the process down (the program and the DB will spend most of their time doing network communications instead of actually working).
Try the following (adapt it to your program's structure, of course):
# Build a list of the 1000 titles you're searching for.
titles = [w["title"] for w in json_works]
# Make exactly one call to the DB, asking for all of the matching documents.
return collection.find({"title": {"$in": titles}})
Further reference on how the $in operator works: http://docs.mongodb.org/manual/reference/operator/query/in/
If after that your queries are still slow, use explain on the find call's return value (more info here: http://docs.mongodb.org/manual/reference/method/cursor.explain/) and check that the query is, in fact, using an index. If it isn't, find out why.
I have a many to many relationship between two tables/objects: Tag and Content. Tag.content is the relationship from a tag to all content which has this tag.
Now I'd like to find out the number of content objects assigned to a tag (for all tags, otherwise I'd simply use len()). The following code almost works:
cnt = db.func.count()
q = db.session.query(Tag, cnt) \
.outerjoin(Tag.content) \
.group_by(Tag) \
.order_by(cnt.desc())
However, it will never return a zero count for obvious reasons - there is at least one row per tag after all due to the LEFT JOIN used. This is a problem though since I'd like to get the correct count for all tags - i.e. 0 if a tag is orphaned.
So I wonder if there's a way to achieve this - obviously without sending n+1 queries to the database. A pure-SQL solution might be ok too, usually it's not too hard to map such a solution to SA somehow.
.filter(Tag.content.any()) removes the results with the incorrect count, but it will do so by removing the rows from the resultset altogether which is not what I want.
Solved it. I needed to use DISTINCT in the COUNTs:
cnt = db.func.count(db.distinct(Content.id))
Suppose you have a sqlite database with several thousand rows -- each of which either contains or references a sizable, unique blob -- and you want to sparsely sample this collection, pulling rows based on rowid or some equivalent primary key. I find that the first time I attempt to fetch several (500) datapoints after connecting (out of 20k rows), the call takes over 10 seconds to return; and, with every successive iteration, the calls get shorter and shorter, until converging to around 100 milliseconds after 50-100 such queries.
Clearly, either sqlite or its python wrapper must be caching... something. If I clear out inactive memory (I'm in OS X, but I think Linux has a comparable if-not-identical "purge" command?), the behavior can be replicated exactly. The question is, what is it caching that an index doesn't address? And furthermore, is it possible to automatically pull whatever information is accelerating these queries into memory from the start? Or is there something else I've missed entirely?
A few notes in case someone doesn't immediately know the answer...
Each blob is around 40kB, and are a large (ha) source of the problem. I've some code below for anyone who wants to play along at home, but I've had better luck keeping separate tables for sortable information and data. This introduces an inner join, but it's generally been better than keeping it all together (although if anyone feels this is wrong, I'm keen to hear it). Without the inner join / data fetch, things start at 4 seconds and drop to 3 ms in a hurry.
I feel like this might be a PRAGMA thing, but I fiddled with some settings suggested by others in the wilderness of the web and didn't really see any benefit.
In-memory databases are not an option. For one, I'm trying to share across threads (which might not actually be a problem for in-mems...? not sure), but more importantly the database files are typically on the order of 17 GB. So, that's out.
That being said, there's no problem caching a reasonable amount of information. After a few dozen calls, inactive memory gets somewhat bloated anyways, but I'd rather do it (1) right and (2) efficiently.
Okay, now some code for anyone who wants to try to replicate things. You should be able to copy and paste it into a stand-alone script (that's basically what I did, save for formatting).
import sqlite3
import numpy as np
import time
ref_uid_index = """CREATE INDEX ref_uid_idx
ON data(ref_uid)"""
def populate_db_split(db_file, num_classes=10, num_points=20000, VERBOSE=False):
def_schema_split0 = """
CREATE TABLE main (
uid INTEGER PRIMARY KEY,
name TEXT,
label INTEGER,
ignore INTEGER default 0,
fold INTEGER default 0)"""
def_schema_split1 = """
CREATE TABLE data (
uid INTEGER PRIMARY KEY,
ref_uid INTEGER REFERENCES main(uid),
data BLOB)"""
def_insert_split0 = """
INSERT INTO main (name, label, fold)
VALUES (?,?,?)"""
def_insert_split1 = """
INSERT INTO data (ref_uid, data)
VALUES (?,?)"""
blob_size= 5000
k_folds = 5
some_names = ['apple', 'banana', 'cherry', 'date']
dbconn = sqlite3.connect(db_file)
dbconn.execute(def_schema_split0)
dbconn.execute(def_schema_split1)
rng = np.random.RandomState()
for n in range(num_points):
if n%1000 == 0 and VERBOSE:
print n
# Make up some data
data = buffer(rng.rand(blob_size).astype(float))
fold = rng.randint(k_folds)
label = rng.randint(num_classes)
rng.shuffle(some_names)
# And add it
dbconn.execute(def_insert_split0,[some_names[0], label, fold])
ref_uid = dbconn.execute("SELECT uid FROM main WHERE rowid=last_insert_rowid()").fetchone()[0]
dbconn.execute(def_insert_split1,[ref_uid,data])
dbconn.execute(ref_uid_index)
dbconn.commit()
return dbconn
def timeit_join(dbconn, n_times=10, num_rows=500):
qmarks = "?,"*(num_rows-1)+"?"
q_join = """SELECT data.data, main.uid, main.label
FROM data INNER JOIN main ON main.uid=data.ref_uid
WHERE main.uid IN (%s)"""%qmarks
row_max = dbconn.execute("SELECT MAX(rowid) from main").fetchone()[0]
tstamps = []
for n in range(n_times):
now = time.time()
uids = np.random.randint(low=1,high=row_max,size=num_rows).tolist()
res = dbconn.execute(q_join, uids).fetchall()
tstamps += [time.time()-now]
print tstamps[-1]
Now, if you want to replicate things, do the following. On my machine, this creates an 800MB database and produces something like below.
>>> db = populate_db_split('/some/file/path.db')
>>> timeit_join(db)
12.0593519211
5.56209111214
3.51154184341
2.20699000359
1.73895692825
1.18351387978
1.27329611778
0.934082984924
0.780968904495
0.834318161011
So... what say you, knowledgable sages?
Database files with GB size are never loaded into the memory entirely. They are split into a tree of socalled pages. These pages are cached in the memory, the default is 2000 pages.
You can use the following statement to e.g. double the number of cached pages of 1kB size.
conn.execute("""PRAGMA cache_size = 4000""")
The connection again has a cache for the last 100 statements, as you can see in the function description:
sqlite3.connect(database[, timeout, detect_types, isolation_level, check_same_thread, factory, cached_statements])
cached_statements expects and integer and defaults to 100.
Except from setting up the cache size, it is not likely that you benefit from actively caching statements or pages at the application start.