Querying relational data in a reasonable amount of time - python

I have a spreadsheet with about 1.7m lines, totalling 1 GB, and need to perform queries on it.
Being most comfortable with Python, my first approach was to hack together a bunch of dictionaries keyed in a way that would facilitate the queries I was trying to make. E.g. if I needed to access everyone with a particular area code and age I would make an areacode_age 2-dimensional dictionary. I ended up needing quite a few which multiplied memory footprint to ~10GB, and even though I had enough RAM the process was slow.
I imported sqlite3 and imported my data into an in-memory database. Turns out doing a query like SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f takes 0.05 seconds. Doing this for my 1.7m rows would take 24 hours. My hacky approach with dictionaries was about 3 orders of magnitude faster for this particular task (and, in this example, I couldn't key on date1 and date2, so I was getting every row that matched name and then filtering by date).
Why is this so slow, and how can I make it fast? And what is the Pythonic approach? Possibilities I've been considering:
sqlite3 is too slow and I need something more heavyweight.
I need to change my schema or my queries to be more optimized.
I need a new tool of some kind.
I read somewhere that in sqlite 3, doing repeated calls to cursor.execute is much slower than using cursor.executemany. It turns out executemany isn't compatible with select statements though, so I think this was a red herring.

sqlite3 is too slow, and I need something more heavyweight
First, sqlite3 is fast, sometime faster than MySQL
Second, you have to use index, put a compound index in (date1, date2, name) will speed thing up significantly

It turns out though, that doing a query like "SELECT (a, b, c) FROM
foo WHERE date1<=d AND date2>e AND name=f" takes 0.05 seconds. Doing
this for my 1.7m rows would take 24 hours of compute time. My hacky
approach with dictionaries was about 3 orders of magnitude faster for
this particular task (and, in this example, I couldn't key on date1
and date2 obviously, so I was getting every row that matched name and
then filtering by date).
Did you actually try this and observe that it was taking 24 hours? Processing time is not necessarily directly proportional to data size.
And are you suggesting that you might need to run SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f 1.7 million times? You only need to run it once, and it will return the entire subset of rows matching your query.
1.7 million rows is not small, but certainly not an issue for a database entirely in memory on your local computer. (No slow disk access; no slow network access.)
Proof is in the pudding. This is pretty fast for me (most of the time is spent in generating ~ 10 million random floats.)
import sqlite3, random
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE numbers (a FLOAT, b FLOAT, c FLOAT, d FLOAT, e FLOAT, f FLOAT)");
for _ in xrange(1700000):
data = [ random.random() for _ in xrange(6) ];
conn.execute("INSERT INTO numbers VALUES (?,?,?,?,?,?)", data)
conn.commit()
print "done generating random numbers"
results = conn.execute("SELECT * FROM numbers WHERE a > 0.5 AND b < 0.5")
accumulator = 0
for row in results:
accumulator += row[0]
print ("Sum of column `a` where a > 0.5 and b < 0.5 is %f" % accumulator)
Edit: Okay, so you really do need to run this 1.7 million times.
In that case, what you probably want is an index. To quote Wikipedia:Database Index:
A database index is a data structure that improves the speed of data
retrieval operations on a database table at the cost of slower writes
and increased storage space. Indexes can be created using one or more
columns of a database table, providing the basis for both rapid random
lookups and efficient access of ordered records.
You would do something like CREATE INDEX dates_and_name ON foo(date1,date2,name) and then (I believe) execute the rest of your SELECT statements as usual. Try this and see if it speeds things up.

Since you are already talking SQL the easiest approach will be:
Put all your data to MySQL table. It should perform well for 1.7 millions of rows.
Add indexes you need, check for settings, make sure it will run fast on big table.
Access it from Python
...
Profit!

Related

Django queryset : how to change the returned datastructure

This problem is related to a gaming arcade parlor where people go in the parlor and play a game. As a person plays, there is a new entry created in the database.
My model is like this:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer()
created = models.DateTimeField(auto_now_add=True)
My view is like this:
today = datetime.now().date()
# i am querying the db for getting the gaming_machine objects where score = 192 or 100 and the count of these objects separately for gaming_machines object which have 192 score and gaming_machine objects which have score as 100
gaming_machine.objects.filter(Q(points=100) | Q(points=192),created__startswith=today).values_list('machine_no','points').annotate(Count('machine_no'))
# this returns a list of tuples -> (machine_no, points, count)
<QuerySet [(330, 192,2), (330, 100,4), (331, 192,7),(331,192,8)]>
Can i change the returned queryset format to something like this:
{(330, 192):2, (330, 100) :4, (331, 192):7,(331,192):8} # that is a dictionary with a key as a tuple consisting (machine_no,score) and value as count of such machine_nos
I am aware that i can change the format of this queryset in the python side using something like dictionary comprehension, but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
Django's lazy queries...
but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
The laziness of Django's querysets actually has (close) to no impact on performance. They are lazy in the sense that they postpone querying the database until you need the result (for example when you start iterating over it). But then they will fetch all the rows. So there is no overhead in each time fetching the next row, all rows are fetched, and then Python iterates over it quite fast.
The laziness is thus not on a row-by-row basis: it does not advances the cursor each time you want to fetch the next row. The communication to the database is thus (quite) limited.
... and why it does not matter (performance-wise)
Unless the number of rows is huge (50'000 or more), the transition to a dictionary should also happen rather fast. So I suspect that the overhead is probably due to the query itself. Especially since Django has to "deserialize" the elements: turn the response into tuples, so although there can be some extra overhead, it usually will be reasonable compared to the work that already is done without the dictionary comprehension. Typically one encodes tasks in queries if they result in less data that is transferred to Python.
For example by performing the count at the database, the database will return an integer per row, instead of several rows, by filtering, we reduce the number of rows as well (since typically not all rows match a given criteria). Furthermore the database has typically fast lookup mechanisms that boost WHEREs, GROUP BYs, ORDER BYs, etc. But post-processing the stream to a different object would usually take the same magnitude of time for a database.
So the dictionary comprehension should do:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
Q(points=100) | Q(points=192),created__startswith=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}
Speeding up queries
Since the problem is probably located at the database, you probably want to consider some possibilities for speedup.
Building indexes
Typically the best way to boost performance of queries, is by constructing an index on columns that you filter on frequently, and have a large number of distinct values.
In that case the database will construct a data structure that stores for every value of that column, a list of rows that match with that value. So as a result, instead of reading through all the rows and selecting the relevant ones, the database can instantly access the datastructure and typically know in reasonable time, what rows have that value.
Note that this typically only helps if the column contains a large number of distinct values: if for example the column only contains two values (in 1% of the cases the value is 0, and 99% of the cases are 1) and we filter on a very common value, this will not produce much speedup, since the set we need to process, has approximately the same size.
So depending on how distinct the values, are, we can add indices to the points, and created field:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer(db_index=True)
created = models.DateTimeField(auto_now_add=True, db_index=True)
Improve the query
Secondly, we can also aim to improve the query itself, although this might be more database dependent (if we have two queries q1 and q2, then it is possible that q1 works faster than q2 on a MySQL database, and q2 works for example faster than q1 on a PostgreSQL database). So this is quite tricky: there are of course some things that typically work in general, but it is hard to give guarantees.
For example somtimes x IN (100, 192) works faster than x = 100 OR x = 192 (see here). Furthermore you here use __startswith, which might perform well - depending on how the database stores timestamps - but it can result in a computationally expensive query if it first needs to convert the datetime. Anyway, it is more declarative to use created__date, since it makes it clear that you want the date of the created equal to today, so a more efficient query is probably:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
points__in=[100, 192], created__date=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}

Profile Millions of Text Files In Parallel Using An Sqlite Counter?

A mountain of text files (of types A, B and C) is sitting on my chest, slowly, coldly refusing me desperately needed air. Over the years each type spec has had enhancements such that yesterday's typeA file has many more properties than last year's typeA. To build a parser that can handle the decade's long evolution of these file types it makes sense to inspect all 14 million of them iteratively, calmly, but before dying beneath their crushing weight.
I built a running counter such that every time I see properties (familiar or not) I add 1 to its tally. The sqlite tally board looks like this:
In the special event I see an unfamiliar property I add them to the tally. On a typeA file that looks like:
I've got this system down! But it's slow # 3M files/36 hours in one process. Originally I was using this trick to pass sqlite a list of properties needing increment.
placeholder= '?' # For SQLite. See DBAPI paramstyle.
placeholders= ', '.join(placeholder for dummy_var in properties)
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE property IN (%s)""" %(type_name, type_name, placeholders)
cursor.execute(sql, properties)
I learned that's a bad idea because
sqlite string search is much slower than indexed search
several hundreds of properties (some 160 characters long) make for really long sql queries
using %s instead of ? is bad security practice... (not a concern ATM)
A "fix" was to maintain a script side property-rowid hash of the tally used in this loop:
Read file for new_properties
Read tally_board for rowid, property
Generate script side client_hash from 2's read
Write rows to tally_board for every new_property not in property (nothing incremented yet). Update client_hash with new properties
Lookup rowid for every row in new_properties using the client_hash
Write increment to every rowid (now a proxy for property) to tally_board
Step 6. looks like
sql = """UPDATE tally_board
SET %s = %s + 1
WHERE rowid IN %s""" %(type_name, type_name, tuple(target_rows))
cur.execute
The problem with this is
It's still slow!
It manifests a race condition in parallel processing that introduces duplicates in the property column whenever threadA starts step 2 right before threadB completes step 6.
A solution to the race condition is to give steps 2-6 an exclusive lock on the db though it doesn't look like reads can get those Lock A Read.
Another attempt uses a genuine UPSERT to increment preexisting property rows AND insert (and increment) new property rows in one fell swoop.
There may be luck in something like this but I'm unsure how to rewrite it to increment the tally.
How about a change of table schema? Instead of a column per type, have a type column. Then you have unique rows identified by property and type, like this:
|rowid|prop |type |count|
============================
|1 |prop_foo|typeA|215 |
|2 |prop_foo|typeB|456 |
This means you can enter a transaction for each and every property of each and every file separately and let sqlite worry about races. So for each property you encounter, immediately issue a complete transaction that computes the next total and upserts the record identified by the property name and file type.
The following sped things up immensely:
Wrote less often to SQLite. Holding most of my intermediate results in memory then updating the DB with them every 50k files resulted in about a third of the execution time (35 hours to 11.5 hours)
Moving data onto my PC (for some reason my USB3.0 port was transferring data well below USB2.0 rates). This resulted in about a fifth of the execution time (11.5 hours to 2.5 hours).

INSERT IGNORE vs IN list()

I have about 20,000 operations I have to do. I need to make sure the 'name' that I have is in the database. Which of the following patterns would be more efficient and why?
(1) in list()
cursor.execute('select * from names')
existing_names = [item[0 for item in cursor.fetchall()] # len = 2,000
for item in items:
if item.name not in existing_names:
cursor.execute('INSERT INTO names VALUES (%s,)', item.name)
(2) INSERT IGNORE
for item in items:
cursor.execute('INSERT IGNORE INTO names VALUES (%s,)', item.name)
The obvious answer here is: test, don't guess.
But I'm pretty sure I can guess, because you've got an algorithmic complexity problem here.
Checking in against a list requires scanning the whole list and comparing every entry. If you do that for 20000 items vs. 2000 list entries, that's 40000000 comparisons. Unless you're skipping almost all 20000 of the SQL statements by doing so, it's almost certainly a pessimization.
However, with one slight change, it might be a useful optimization:
Checking in against a set is near-instant. If you do that for 20000 items vs. 2000 set entries, that's 20000 hashes and lookups. That could easily be worth saving even just a few thousand SQL queries. If you're on Python 2.7 or later, that's just a matter of existing_names = { … } instead of [ … ].
In case you're wondering, inside the database (assuming you have an index on the column), it's using a tree structure, so each look up takes logarithmic time. Even for a binary tree (which is over-estimating the real cost), that's under 11 comparisons for each lookup, which isn't quite as good as 1, but it's a lot better than 2000. (Plus, of course, that search is going to be optimized, because it's one of the core things that databases have to do well.)
And finally, at least with some database libraries, you can get a much bigger speedup by batching the inserts—maybe using executemany, or maybe preparing and loading bulk SQL—so you may be optimizing the wrong place anyway.
I would use method 2. However, If you do not have a unique index on names your second method would definitely not insure that your names are unique.
If you need more info on creating a unique index, you can find it Here.
Your first method would appear to be less efficient then the second due to the fact that you have to first get the list of unique names, then test if it does not match any of them in the loop.
whereas in the second method, maintaining a a unique index may take overall more overhead than the first method, but would probably be more efficient than doing the processing outside the DB. Additional in second method you are only hitting the DB once too.

Use SQL or python to check set membership?

How would the following compare in performance to see whether the ID I have is in a set.
# python list
list_of_ids = [1,2,3,...]
if id in list_of_ids:
# ok
# python set
set_of_ids = set([1,2,3,...])
if id in set_of_ids:
# ok
# python dict
dict_of_ids = {1:,2:,3:,...}
if id in dict_of_ids:
# ok
# SQL
cursor.execute('SELECT * FROM mytable WHERE id = %s, id)
if cursor.fetchone():
// in C
# [ not written]
How would these compare?
Algorithmically speaking, the first approach took linear both time and space O(n).
the second and the third approach uses HASH table runs faster than O(log(n))
And the SQL approach uses B-tree, if an index was on that field, its time complexity is O(log(n)).
If you use C, it save some time because the C language skip many non-efficient part.
Conclusion:
The first approach took O(n) time and memory cost, better not use it.
The second and third one is fast enought, but if the data is too large, it will be slow.
The SQL approach may cost network communicating time, so it has other part cost, but if the data is large, I think the SQL way will be more reasonable.
The C approach is extremely fast, if you REALLY need it (of course, the algorithm it use must be efficient), and that would make the code ugly anyway.
Hope it helps.

Python MySQLdb SScursor slow in comparison to exporting and importing from CSV-file. Speedup possible?

As part of building a Data Warehouse, I have to query a source database table for about 75M rows.
What I want to do with the 75M rows is some processing and then adding the result into another database. Now, this is quite a lot of data, and I've had success with mainly two approaches:
1) Exporting the query to a CSV file using the "SELECT ... INTO" capabilities of MySQL and using the fileinput module of python to read it, and
2) connecting to the MySQL database using MySQLdb's SScursor (the default cursor puts the query in the memory, killing the python script) and fetch the results in chunks of about 10k rows (which is the chunk size I've found to be the fastest).
The first approach is a SQL query executed "by hand" (takes about 6 minutes) followed by a python script reading the csv-file and processing it. The reason I use fileinput to read the file is that fileinput doesn't load the whole file into the memory from the beginning, and works well with larger files. Just traversing the file (reading every line in the file and calling pass) takes about 80 seconds, that is 1M rows/s.
The second approach is a python script executing the same query (also takes about 6 minutes, or slightly longer) and then a while-loop fetching chunks of rows for as long as there is any left in the SScursor. Here, just reading the lines (fetching one chunk after another and not doing anything else) takes about 15 minutes, or approximately 85k rows/s.
The two numbers (rows/s) above are perhaps not really comparable, but when benchmarking the two approaches in my application, the first one takes about 20 minutes (of which about five is MySQL dumping into a CSV file), and the second one takes about 35 minutes (of which about five minutes is the query being executed). This means that dumping and reading to/from a CSV file is about twice as fast as using an SScursor directly.
This would be no problem, if it did not restrict the portability of my system: a "SELECT ... INTO" statement requires MySQL to have writing privileges, and I suspect that is is not as safe as using cursors. On the other hand, 15 minutes (and growing, as the source database grows) is not really something I can spare on every build.
So, am I missing something? Is there any known reason for SScursor to be so much slower than dumping/reading to/from a CSV file, such that fileinput is C optimized where SScursor is not? Any ideas on how to proceed with this problem? Anything to test? I would belive that SScursor could be as fast as the first approach, but after reading all I can find about the matter, I'm stumped.
Now, to the code:
Not that I think the query is of any problem (it's as fast as I can ask for and takes similar time in both approaches), but here it is for the sake of completeness:
SELECT LT.SomeID, LT.weekID, W.monday, GREATEST(LT.attr1, LT.attr2)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC;
The primary code in the first approach is something like this
import fileinput
INPUT_PATH = 'path/to/csv/dump/dump.csv'
event_list = []
ID = -1
for line in fileinput.input([INPUT_PATH]):
split_line = line.split(';')
if split_line[0] == ID:
event_list.append(split_line[1:])
else:
process_function(ID,event_list)
event_list = [ split_line[1:] ]
ID = split_line[0]
process_function(ID,event_list)
The primary code in the second approach is:
import MySQLdb
...opening connection, defining SScursor called ssc...
CHUNK_SIZE = 100000
query_stmt = """SELECT LT.SomeID, LT.weekID, W.monday,
GREATEST(LT.attr1, LT.attr2)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC"""
ssc.execute(query_stmt)
event_list = []
ID = -1
data_chunk = ssc.fetchmany(CHUNK_SIZE)
while data_chunk:
for row in data_chunk:
if row[0] == ID:
event_list.append([ row[1], row[2], row[3] ])
else:
process_function(ID,event_list)
event_list = [[ row[1], row[2], row[3] ]]
ID = row[0]
data_chunk = ssc.fetchmany(CHUNK_SIZE)
process_function(ID,event_list)
At last, I'm on Ubuntu 13.04 with MySQL server 5.5.31. I use Python 2.7.4 with MySQLdb 1.2.3. Thank you for staying with me this long!
After using cProfile I found a lot of time being spent implicitly constructing Decimal objects, since that was the numeric type returned from the SQL query into my Python script. In the first approach, the Decimal value was written to the CSV file as an integer and then read as such by the Python script. The CSV file I/O "flattened" the data, making the script faster. The two scripts are now about the same speed (the second approach is still just a tad slower).
I also did some conversion of the date in the MySQL database to integer type. My query is now:
SELECT LT.SomeID,
LT.weekID,
CAST(DATE_FORMAT(W.monday,'%Y%m%d') AS UNSIGNED),
CAST(GREATEST(LT.attr1, LT.attr2) AS UNSIGNED)
FROM LargeTable LT JOIN Week W ON LT.weekID = W.ID
ORDER BY LT.someID ASC, LT.weekID ASC;
This almost eliminates the difference in processing time between the two approaches.
The lesson here is that when doing large queries, post processing of data types DOES MATTER! Rewriting the query to reducing function calls in Python can improve the overall processing speed significantly.

Categories