INSERT IGNORE vs IN list()

INSERT IGNORE vs IN list() - python

I have about 20,000 operations I have to do. I need to make sure the 'name' that I have is in the database. Which of the following patterns would be more efficient and why?
(1) in list()
cursor.execute('select * from names')
existing_names = [item[0 for item in cursor.fetchall()] # len = 2,000
for item in items:
if item.name not in existing_names:
cursor.execute('INSERT INTO names VALUES (%s,)', item.name)
(2) INSERT IGNORE
for item in items:
cursor.execute('INSERT IGNORE INTO names VALUES (%s,)', item.name)

The obvious answer here is: test, don't guess.
But I'm pretty sure I can guess, because you've got an algorithmic complexity problem here.
Checking in against a list requires scanning the whole list and comparing every entry. If you do that for 20000 items vs. 2000 list entries, that's 40000000 comparisons. Unless you're skipping almost all 20000 of the SQL statements by doing so, it's almost certainly a pessimization.
However, with one slight change, it might be a useful optimization:
Checking in against a set is near-instant. If you do that for 20000 items vs. 2000 set entries, that's 20000 hashes and lookups. That could easily be worth saving even just a few thousand SQL queries. If you're on Python 2.7 or later, that's just a matter of existing_names = { … } instead of [ … ].
In case you're wondering, inside the database (assuming you have an index on the column), it's using a tree structure, so each look up takes logarithmic time. Even for a binary tree (which is over-estimating the real cost), that's under 11 comparisons for each lookup, which isn't quite as good as 1, but it's a lot better than 2000. (Plus, of course, that search is going to be optimized, because it's one of the core things that databases have to do well.)
And finally, at least with some database libraries, you can get a much bigger speedup by batching the inserts—maybe using executemany, or maybe preparing and loading bulk SQL—so you may be optimizing the wrong place anyway.

I would use method 2. However, If you do not have a unique index on names your second method would definitely not insure that your names are unique.
If you need more info on creating a unique index, you can find it Here.
Your first method would appear to be less efficient then the second due to the fact that you have to first get the list of unique names, then test if it does not match any of them in the loop.
whereas in the second method, maintaining a a unique index may take overall more overhead than the first method, but would probably be more efficient than doing the processing outside the DB. Additional in second method you are only hitting the DB once too.

Related

Django queryset : how to change the returned datastructure

This problem is related to a gaming arcade parlor where people go in the parlor and play a game. As a person plays, there is a new entry created in the database.
My model is like this:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer()
created = models.DateTimeField(auto_now_add=True)
My view is like this:
today = datetime.now().date()
# i am querying the db for getting the gaming_machine objects where score = 192 or 100 and the count of these objects separately for gaming_machines object which have 192 score and gaming_machine objects which have score as 100
gaming_machine.objects.filter(Q(points=100) | Q(points=192),created__startswith=today).values_list('machine_no','points').annotate(Count('machine_no'))
# this returns a list of tuples -> (machine_no, points, count)
<QuerySet [(330, 192,2), (330, 100,4), (331, 192,7),(331,192,8)]>
Can i change the returned queryset format to something like this:
{(330, 192):2, (330, 100) :4, (331, 192):7,(331,192):8} # that is a dictionary with a key as a tuple consisting (machine_no,score) and value as count of such machine_nos
I am aware that i can change the format of this queryset in the python side using something like dictionary comprehension, but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.

Django's lazy queries...
but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
The laziness of Django's querysets actually has (close) to no impact on performance. They are lazy in the sense that they postpone querying the database until you need the result (for example when you start iterating over it). But then they will fetch all the rows. So there is no overhead in each time fetching the next row, all rows are fetched, and then Python iterates over it quite fast.
The laziness is thus not on a row-by-row basis: it does not advances the cursor each time you want to fetch the next row. The communication to the database is thus (quite) limited.
... and why it does not matter (performance-wise)
Unless the number of rows is huge (50'000 or more), the transition to a dictionary should also happen rather fast. So I suspect that the overhead is probably due to the query itself. Especially since Django has to "deserialize" the elements: turn the response into tuples, so although there can be some extra overhead, it usually will be reasonable compared to the work that already is done without the dictionary comprehension. Typically one encodes tasks in queries if they result in less data that is transferred to Python.
For example by performing the count at the database, the database will return an integer per row, instead of several rows, by filtering, we reduce the number of rows as well (since typically not all rows match a given criteria). Furthermore the database has typically fast lookup mechanisms that boost WHEREs, GROUP BYs, ORDER BYs, etc. But post-processing the stream to a different object would usually take the same magnitude of time for a database.
So the dictionary comprehension should do:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
Q(points=100) | Q(points=192),created__startswith=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}
Speeding up queries
Since the problem is probably located at the database, you probably want to consider some possibilities for speedup.
Building indexes
Typically the best way to boost performance of queries, is by constructing an index on columns that you filter on frequently, and have a large number of distinct values.
In that case the database will construct a data structure that stores for every value of that column, a list of rows that match with that value. So as a result, instead of reading through all the rows and selecting the relevant ones, the database can instantly access the datastructure and typically know in reasonable time, what rows have that value.
Note that this typically only helps if the column contains a large number of distinct values: if for example the column only contains two values (in 1% of the cases the value is 0, and 99% of the cases are 1) and we filter on a very common value, this will not produce much speedup, since the set we need to process, has approximately the same size.
So depending on how distinct the values, are, we can add indices to the points, and created field:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer(db_index=True)
created = models.DateTimeField(auto_now_add=True, db_index=True)
Improve the query
Secondly, we can also aim to improve the query itself, although this might be more database dependent (if we have two queries q1 and q2, then it is possible that q1 works faster than q2 on a MySQL database, and q2 works for example faster than q1 on a PostgreSQL database). So this is quite tricky: there are of course some things that typically work in general, but it is hard to give guarantees.
For example somtimes x IN (100, 192) works faster than x = 100 OR x = 192 (see here). Furthermore you here use __startswith, which might perform well - depending on how the database stores timestamps - but it can result in a computationally expensive query if it first needs to convert the datetime. Anyway, it is more declarative to use created__date, since it makes it clear that you want the date of the created equal to today, so a more efficient query is probably:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
points__in=[100, 192], created__date=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}

Use SQL or python to check set membership?

How would the following compare in performance to see whether the ID I have is in a set.
# python list
list_of_ids = [1,2,3,...]
if id in list_of_ids:
# ok
# python set
set_of_ids = set([1,2,3,...])
if id in set_of_ids:
# ok
# python dict
dict_of_ids = {1:,2:,3:,...}
if id in dict_of_ids:
# ok
# SQL
cursor.execute('SELECT * FROM mytable WHERE id = %s, id)
if cursor.fetchone():
// in C
# [ not written]
How would these compare?

Algorithmically speaking, the first approach took linear both time and space O(n).
the second and the third approach uses HASH table runs faster than O(log(n))
And the SQL approach uses B-tree, if an index was on that field, its time complexity is O(log(n)).
If you use C, it save some time because the C language skip many non-efficient part.
Conclusion:
The first approach took O(n) time and memory cost, better not use it.
The second and third one is fast enought, but if the data is too large, it will be slow.
The SQL approach may cost network communicating time, so it has other part cost, but if the data is large, I think the SQL way will be more reasonable.
The C approach is extremely fast, if you REALLY need it (of course, the algorithm it use must be efficient), and that would make the code ugly anyway.
Hope it helps.

Querying relational data in a reasonable amount of time

I have a spreadsheet with about 1.7m lines, totalling 1 GB, and need to perform queries on it.
Being most comfortable with Python, my first approach was to hack together a bunch of dictionaries keyed in a way that would facilitate the queries I was trying to make. E.g. if I needed to access everyone with a particular area code and age I would make an areacode_age 2-dimensional dictionary. I ended up needing quite a few which multiplied memory footprint to ~10GB, and even though I had enough RAM the process was slow.
I imported sqlite3 and imported my data into an in-memory database. Turns out doing a query like SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f takes 0.05 seconds. Doing this for my 1.7m rows would take 24 hours. My hacky approach with dictionaries was about 3 orders of magnitude faster for this particular task (and, in this example, I couldn't key on date1 and date2, so I was getting every row that matched name and then filtering by date).
Why is this so slow, and how can I make it fast? And what is the Pythonic approach? Possibilities I've been considering:
sqlite3 is too slow and I need something more heavyweight.
I need to change my schema or my queries to be more optimized.
I need a new tool of some kind.
I read somewhere that in sqlite 3, doing repeated calls to cursor.execute is much slower than using cursor.executemany. It turns out executemany isn't compatible with select statements though, so I think this was a red herring.

sqlite3 is too slow, and I need something more heavyweight
First, sqlite3 is fast, sometime faster than MySQL
Second, you have to use index, put a compound index in (date1, date2, name) will speed thing up significantly

It turns out though, that doing a query like "SELECT (a, b, c) FROM
foo WHERE date1<=d AND date2>e AND name=f" takes 0.05 seconds. Doing
this for my 1.7m rows would take 24 hours of compute time. My hacky
approach with dictionaries was about 3 orders of magnitude faster for
this particular task (and, in this example, I couldn't key on date1
and date2 obviously, so I was getting every row that matched name and
then filtering by date).
Did you actually try this and observe that it was taking 24 hours? Processing time is not necessarily directly proportional to data size.
And are you suggesting that you might need to run SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f 1.7 million times? You only need to run it once, and it will return the entire subset of rows matching your query.
1.7 million rows is not small, but certainly not an issue for a database entirely in memory on your local computer. (No slow disk access; no slow network access.)
Proof is in the pudding. This is pretty fast for me (most of the time is spent in generating ~ 10 million random floats.)
import sqlite3, random
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE numbers (a FLOAT, b FLOAT, c FLOAT, d FLOAT, e FLOAT, f FLOAT)");
for _ in xrange(1700000):
data = [ random.random() for _ in xrange(6) ];
conn.execute("INSERT INTO numbers VALUES (?,?,?,?,?,?)", data)
conn.commit()
print "done generating random numbers"
results = conn.execute("SELECT * FROM numbers WHERE a > 0.5 AND b < 0.5")
accumulator = 0
for row in results:
accumulator += row[0]
print ("Sum of column `a` where a > 0.5 and b < 0.5 is %f" % accumulator)
Edit: Okay, so you really do need to run this 1.7 million times.
In that case, what you probably want is an index. To quote Wikipedia:Database Index:
A database index is a data structure that improves the speed of data
retrieval operations on a database table at the cost of slower writes
and increased storage space. Indexes can be created using one or more
columns of a database table, providing the basis for both rapid random
lookups and efficient access of ordered records.
You would do something like CREATE INDEX dates_and_name ON foo(date1,date2,name) and then (I believe) execute the rest of your SELECT statements as usual. Try this and see if it speeds things up.

Since you are already talking SQL the easiest approach will be:
Put all your data to MySQL table. It should perform well for 1.7 millions of rows.
Add indexes you need, check for settings, make sure it will run fast on big table.
Access it from Python
...
Profit!

Which one is more efficient?

I have a Python program for deleting duplicates from a list of names.
But I'm in a dilemma and searching out for a most efficient way out of both means.
I have uploaded a list of names to a SQLite DB, into a column in a table.
Whether comparing the names and deleting the duplicates out of them in a DB is good or loading them to Python means getting them into Python and deleting the duplicates and pushing them back to the DB is good?
I'm confused and here is a piece of code to do it on SQLite:
dup_killer (member_id, date) SELECT * FROM talks GROUP BY member_id,

If you use the names as a key in the database, the database will make sure they are not duplicated. So there would be no reason to ship the list to Python and de-dup there.
If you haven't inserted the names into the database yet, you might as well de-dup them in Python first. It is probably faster to do it in Python using the built-in features than to incur the overhead of repeated attempts to insert to the database.
(By the way: you can really speed up the insertion of many names if you wrap all the inserts in a single transaction. Start a transaction, insert all the names, and finish the transaction. The database does some work to make sure that the database is consistent, and it's much more efficient to do that work once for a whole list of names, rather than doing it once per name.)
If you have the list in Python, you can de-dup it very quickly using built-in features. The two common features that are useful for de-duping are the set and the dict.
I have given you three examples. The simplest case is where you have a list that just contains names, and you want to get a list with just unique names; you can just put the list into a set. The second case is that your list contains records and you need to extract the name part to build the set. The third case shows how to build a dict that maps a name onto a record, then inserts the record into a database; like a set, a dict will only allow unique values to be used as keys. When the dict is built, it will keep the last value from the list with the same name.
# list already contains names
unique_names = set(list_of_all_names)
unique_list = list(unique_names) # lst now contains only unique names
# extract record field from each record and make set
unique_names = set(x.name for x in list_of_all_records)
unique_list = list(unique_names) # lst now contains only unique names
# make dict mapping name to a complete record
d = dict((x.name, x) for x in list_of_records)
# insert complete record into database using name as key
for name in d:
insert_into_database(d[name])

Batch select with SQLAlchemy

I have a large set of values V, some of which are likely to exist in a table T. I would like to insert into the table those which are not yet inserted. So far I have the code:
for value in values:
s = self.conn.execute(mytable.__table__.select(mytable.value == value)).first()
if not s:
to_insert.append(value)
I feel like this is running slower than it should. I have a few related questions:
Is there a way to construct a select statement such that you provide a list (in this case, 'values') to which sqlalchemy responds with records which match that list?
Is this code overly expensive in constructing select objects? Is there a way to construct a single select statement, then parameterize at execution time?

For the first question, something like this if I understand your question correctly
mytable.__table__.select(mytable.value.in_(values)
For the second question, querying this by 1 row at a time is overly expensive indeed, although you might not have a choice in the matter. As far as I know there is no tuple select support in SQLAlchemy so if there are multiple variables (think polymorhpic keys) than SQLAlchemy can't help you.
Either way, if you select all matching rows and insert the difference you should be done :)
Something like this should work:
results = self.conn.execute(mytable.__table__.select(mytable.value.in_(values))
available_values = set(row.value for row in results)
to_insert = set(values) - available_values

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.