Use SQL or python to check set membership? - python

How would the following compare in performance to see whether the ID I have is in a set.
# python list
list_of_ids = [1,2,3,...]
if id in list_of_ids:
# ok
# python set
set_of_ids = set([1,2,3,...])
if id in set_of_ids:
# ok
# python dict
dict_of_ids = {1:,2:,3:,...}
if id in dict_of_ids:
# ok
# SQL
cursor.execute('SELECT * FROM mytable WHERE id = %s, id)
if cursor.fetchone():
// in C
# [ not written]
How would these compare?

Algorithmically speaking, the first approach took linear both time and space O(n).
the second and the third approach uses HASH table runs faster than O(log(n))
And the SQL approach uses B-tree, if an index was on that field, its time complexity is O(log(n)).
If you use C, it save some time because the C language skip many non-efficient part.
Conclusion:
The first approach took O(n) time and memory cost, better not use it.
The second and third one is fast enought, but if the data is too large, it will be slow.
The SQL approach may cost network communicating time, so it has other part cost, but if the data is large, I think the SQL way will be more reasonable.
The C approach is extremely fast, if you REALLY need it (of course, the algorithm it use must be efficient), and that would make the code ugly anyway.
Hope it helps.

Related

Django queryset : how to change the returned datastructure

This problem is related to a gaming arcade parlor where people go in the parlor and play a game. As a person plays, there is a new entry created in the database.
My model is like this:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer()
created = models.DateTimeField(auto_now_add=True)
My view is like this:
today = datetime.now().date()
# i am querying the db for getting the gaming_machine objects where score = 192 or 100 and the count of these objects separately for gaming_machines object which have 192 score and gaming_machine objects which have score as 100
gaming_machine.objects.filter(Q(points=100) | Q(points=192),created__startswith=today).values_list('machine_no','points').annotate(Count('machine_no'))
# this returns a list of tuples -> (machine_no, points, count)
<QuerySet [(330, 192,2), (330, 100,4), (331, 192,7),(331,192,8)]>
Can i change the returned queryset format to something like this:
{(330, 192):2, (330, 100) :4, (331, 192):7,(331,192):8} # that is a dictionary with a key as a tuple consisting (machine_no,score) and value as count of such machine_nos
I am aware that i can change the format of this queryset in the python side using something like dictionary comprehension, but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
Django's lazy queries...
but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
The laziness of Django's querysets actually has (close) to no impact on performance. They are lazy in the sense that they postpone querying the database until you need the result (for example when you start iterating over it). But then they will fetch all the rows. So there is no overhead in each time fetching the next row, all rows are fetched, and then Python iterates over it quite fast.
The laziness is thus not on a row-by-row basis: it does not advances the cursor each time you want to fetch the next row. The communication to the database is thus (quite) limited.
... and why it does not matter (performance-wise)
Unless the number of rows is huge (50'000 or more), the transition to a dictionary should also happen rather fast. So I suspect that the overhead is probably due to the query itself. Especially since Django has to "deserialize" the elements: turn the response into tuples, so although there can be some extra overhead, it usually will be reasonable compared to the work that already is done without the dictionary comprehension. Typically one encodes tasks in queries if they result in less data that is transferred to Python.
For example by performing the count at the database, the database will return an integer per row, instead of several rows, by filtering, we reduce the number of rows as well (since typically not all rows match a given criteria). Furthermore the database has typically fast lookup mechanisms that boost WHEREs, GROUP BYs, ORDER BYs, etc. But post-processing the stream to a different object would usually take the same magnitude of time for a database.
So the dictionary comprehension should do:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
Q(points=100) | Q(points=192),created__startswith=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}
Speeding up queries
Since the problem is probably located at the database, you probably want to consider some possibilities for speedup.
Building indexes
Typically the best way to boost performance of queries, is by constructing an index on columns that you filter on frequently, and have a large number of distinct values.
In that case the database will construct a data structure that stores for every value of that column, a list of rows that match with that value. So as a result, instead of reading through all the rows and selecting the relevant ones, the database can instantly access the datastructure and typically know in reasonable time, what rows have that value.
Note that this typically only helps if the column contains a large number of distinct values: if for example the column only contains two values (in 1% of the cases the value is 0, and 99% of the cases are 1) and we filter on a very common value, this will not produce much speedup, since the set we need to process, has approximately the same size.
So depending on how distinct the values, are, we can add indices to the points, and created field:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer(db_index=True)
created = models.DateTimeField(auto_now_add=True, db_index=True)
Improve the query
Secondly, we can also aim to improve the query itself, although this might be more database dependent (if we have two queries q1 and q2, then it is possible that q1 works faster than q2 on a MySQL database, and q2 works for example faster than q1 on a PostgreSQL database). So this is quite tricky: there are of course some things that typically work in general, but it is hard to give guarantees.
For example somtimes x IN (100, 192) works faster than x = 100 OR x = 192 (see here). Furthermore you here use __startswith, which might perform well - depending on how the database stores timestamps - but it can result in a computationally expensive query if it first needs to convert the datetime. Anyway, it is more declarative to use created__date, since it makes it clear that you want the date of the created equal to today, so a more efficient query is probably:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
points__in=[100, 192], created__date=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}

INSERT IGNORE vs IN list()

I have about 20,000 operations I have to do. I need to make sure the 'name' that I have is in the database. Which of the following patterns would be more efficient and why?
(1) in list()
cursor.execute('select * from names')
existing_names = [item[0 for item in cursor.fetchall()] # len = 2,000
for item in items:
if item.name not in existing_names:
cursor.execute('INSERT INTO names VALUES (%s,)', item.name)
(2) INSERT IGNORE
for item in items:
cursor.execute('INSERT IGNORE INTO names VALUES (%s,)', item.name)
The obvious answer here is: test, don't guess.
But I'm pretty sure I can guess, because you've got an algorithmic complexity problem here.
Checking in against a list requires scanning the whole list and comparing every entry. If you do that for 20000 items vs. 2000 list entries, that's 40000000 comparisons. Unless you're skipping almost all 20000 of the SQL statements by doing so, it's almost certainly a pessimization.
However, with one slight change, it might be a useful optimization:
Checking in against a set is near-instant. If you do that for 20000 items vs. 2000 set entries, that's 20000 hashes and lookups. That could easily be worth saving even just a few thousand SQL queries. If you're on Python 2.7 or later, that's just a matter of existing_names = { … } instead of [ … ].
In case you're wondering, inside the database (assuming you have an index on the column), it's using a tree structure, so each look up takes logarithmic time. Even for a binary tree (which is over-estimating the real cost), that's under 11 comparisons for each lookup, which isn't quite as good as 1, but it's a lot better than 2000. (Plus, of course, that search is going to be optimized, because it's one of the core things that databases have to do well.)
And finally, at least with some database libraries, you can get a much bigger speedup by batching the inserts—maybe using executemany, or maybe preparing and loading bulk SQL—so you may be optimizing the wrong place anyway.
I would use method 2. However, If you do not have a unique index on names your second method would definitely not insure that your names are unique.
If you need more info on creating a unique index, you can find it Here.
Your first method would appear to be less efficient then the second due to the fact that you have to first get the list of unique names, then test if it does not match any of them in the loop.
whereas in the second method, maintaining a a unique index may take overall more overhead than the first method, but would probably be more efficient than doing the processing outside the DB. Additional in second method you are only hitting the DB once too.

Querying relational data in a reasonable amount of time

I have a spreadsheet with about 1.7m lines, totalling 1 GB, and need to perform queries on it.
Being most comfortable with Python, my first approach was to hack together a bunch of dictionaries keyed in a way that would facilitate the queries I was trying to make. E.g. if I needed to access everyone with a particular area code and age I would make an areacode_age 2-dimensional dictionary. I ended up needing quite a few which multiplied memory footprint to ~10GB, and even though I had enough RAM the process was slow.
I imported sqlite3 and imported my data into an in-memory database. Turns out doing a query like SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f takes 0.05 seconds. Doing this for my 1.7m rows would take 24 hours. My hacky approach with dictionaries was about 3 orders of magnitude faster for this particular task (and, in this example, I couldn't key on date1 and date2, so I was getting every row that matched name and then filtering by date).
Why is this so slow, and how can I make it fast? And what is the Pythonic approach? Possibilities I've been considering:
sqlite3 is too slow and I need something more heavyweight.
I need to change my schema or my queries to be more optimized.
I need a new tool of some kind.
I read somewhere that in sqlite 3, doing repeated calls to cursor.execute is much slower than using cursor.executemany. It turns out executemany isn't compatible with select statements though, so I think this was a red herring.
sqlite3 is too slow, and I need something more heavyweight
First, sqlite3 is fast, sometime faster than MySQL
Second, you have to use index, put a compound index in (date1, date2, name) will speed thing up significantly
It turns out though, that doing a query like "SELECT (a, b, c) FROM
foo WHERE date1<=d AND date2>e AND name=f" takes 0.05 seconds. Doing
this for my 1.7m rows would take 24 hours of compute time. My hacky
approach with dictionaries was about 3 orders of magnitude faster for
this particular task (and, in this example, I couldn't key on date1
and date2 obviously, so I was getting every row that matched name and
then filtering by date).
Did you actually try this and observe that it was taking 24 hours? Processing time is not necessarily directly proportional to data size.
And are you suggesting that you might need to run SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f 1.7 million times? You only need to run it once, and it will return the entire subset of rows matching your query.
1.7 million rows is not small, but certainly not an issue for a database entirely in memory on your local computer. (No slow disk access; no slow network access.)
Proof is in the pudding. This is pretty fast for me (most of the time is spent in generating ~ 10 million random floats.)
import sqlite3, random
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE numbers (a FLOAT, b FLOAT, c FLOAT, d FLOAT, e FLOAT, f FLOAT)");
for _ in xrange(1700000):
data = [ random.random() for _ in xrange(6) ];
conn.execute("INSERT INTO numbers VALUES (?,?,?,?,?,?)", data)
conn.commit()
print "done generating random numbers"
results = conn.execute("SELECT * FROM numbers WHERE a > 0.5 AND b < 0.5")
accumulator = 0
for row in results:
accumulator += row[0]
print ("Sum of column `a` where a > 0.5 and b < 0.5 is %f" % accumulator)
Edit: Okay, so you really do need to run this 1.7 million times.
In that case, what you probably want is an index. To quote Wikipedia:Database Index:
A database index is a data structure that improves the speed of data
retrieval operations on a database table at the cost of slower writes
and increased storage space. Indexes can be created using one or more
columns of a database table, providing the basis for both rapid random
lookups and efficient access of ordered records.
You would do something like CREATE INDEX dates_and_name ON foo(date1,date2,name) and then (I believe) execute the rest of your SELECT statements as usual. Try this and see if it speeds things up.
Since you are already talking SQL the easiest approach will be:
Put all your data to MySQL table. It should perform well for 1.7 millions of rows.
Add indexes you need, check for settings, make sure it will run fast on big table.
Access it from Python
...
Profit!

Efficient way of phrasing multiple tuple pair WHERE conditions in SQL statement

I want to perform an SQL query that is logically equivalent to the following:
DELETE FROM pond_pairs
WHERE
((pond1 = 12) AND (pond2 = 233)) OR
((pond1 = 12) AND (pond2 = 234)) OR
((pond1 = 12) AND (pond2 = 8)) OR
((pond1 = 13) AND (pond2 = 6547)) OR
((pond1 = 13879) AND (pond2 = 6))
I will have hundreds of thousands pond1-pond2 pairs. I have an index on (pond1, pond2).
My limited SQL knowledge came up with several approaches:
Run the whole query as is.
Batch the query up into smaller queries with n WHERE conditions
Save the pond1-pond2 pairs into a new table, and do a subquery in the WHERE clause to identify
Convert the python logic which identifies rows to delete into a stored procedure. Note that I am unfamiliar with programming stored procedures and thus this would probably involve a steep learning curve.
I am using postgres if that is relevant.
For a large number of pond1-pond2 pairs to be deleted in a single DELETE, I would create temporary table and join on this table.
-- Create the temp table:
CREATE TEMP TABLE foo AS SELECT * FROM (VALUES(1,2), (1,3)) AS sub (pond1, pond2);
-- Delete
DELETE FROM bar
USING
foo -- the joined table
WHERE
bar.pond1= foo.pond1
AND
bar.pond2 = foo.pond2;
I will do 3. (with JOIN rather than subquery) and measure time of DELETE query (without creating table and inserting). This is good starting point, because JOINing is very common and optimized procedure, so It will be hard to beat that time. Then you can compare that time to your current approach.
Also you can try following approach:
Sort pairs in same way as in index.
Delete using method 2. from your description (probably in single transaction).
Sorting before delete will give better index reading performance, because there's greater chance for hard-drive cache to work.
With hundred of thousands of pairs, you cannot do 1 (run the query as is), because the SQL statement would be too long.
3 is good if you have the pairs already in a table. If not, you would need to insert them first. If you do not need them later, you might just as well run the same amount of DELETE statements instead of INSERT statements.
How about a prepared statement in a loop, maybe batched (if Python supports that)
begin transaction
prepare statement "DELETE FROM pond_pairs WHERE ((pond1 = ?) AND (pond2 = ?))"
loop over your data (in Python), and run the statement with one pair (or add to batch)
commit
Where are the pairs coming from? If you can write a SELECT statements to identify them, you can just move this condition into the WHERE clause of your delete.
DELETE FROM pond_pairs WHERE (pond1, ponds) in (SELECT pond1, pond2 FROM ...... )

Batch select with SQLAlchemy

I have a large set of values V, some of which are likely to exist in a table T. I would like to insert into the table those which are not yet inserted. So far I have the code:
for value in values:
s = self.conn.execute(mytable.__table__.select(mytable.value == value)).first()
if not s:
to_insert.append(value)
I feel like this is running slower than it should. I have a few related questions:
Is there a way to construct a select statement such that you provide a list (in this case, 'values') to which sqlalchemy responds with records which match that list?
Is this code overly expensive in constructing select objects? Is there a way to construct a single select statement, then parameterize at execution time?
For the first question, something like this if I understand your question correctly
mytable.__table__.select(mytable.value.in_(values)
For the second question, querying this by 1 row at a time is overly expensive indeed, although you might not have a choice in the matter. As far as I know there is no tuple select support in SQLAlchemy so if there are multiple variables (think polymorhpic keys) than SQLAlchemy can't help you.
Either way, if you select all matching rows and insert the difference you should be done :)
Something like this should work:
results = self.conn.execute(mytable.__table__.select(mytable.value.in_(values))
available_values = set(row.value for row in results)
to_insert = set(values) - available_values

Categories