Batch select with SQLAlchemy - python

I have a large set of values V, some of which are likely to exist in a table T. I would like to insert into the table those which are not yet inserted. So far I have the code:
for value in values:
s = self.conn.execute(mytable.__table__.select(mytable.value == value)).first()
if not s:
to_insert.append(value)
I feel like this is running slower than it should. I have a few related questions:
Is there a way to construct a select statement such that you provide a list (in this case, 'values') to which sqlalchemy responds with records which match that list?
Is this code overly expensive in constructing select objects? Is there a way to construct a single select statement, then parameterize at execution time?

For the first question, something like this if I understand your question correctly
mytable.__table__.select(mytable.value.in_(values)
For the second question, querying this by 1 row at a time is overly expensive indeed, although you might not have a choice in the matter. As far as I know there is no tuple select support in SQLAlchemy so if there are multiple variables (think polymorhpic keys) than SQLAlchemy can't help you.
Either way, if you select all matching rows and insert the difference you should be done :)
Something like this should work:
results = self.conn.execute(mytable.__table__.select(mytable.value.in_(values))
available_values = set(row.value for row in results)
to_insert = set(values) - available_values

Related

Difference between different ways of updating a single row

I'm looking for the "standard" way of modifying a single row from a given table. (I want to modify several values within the row.)
I've found many ways and I struggle understanding the impact of this choice. This answer is an example of what I'm not looking for but also an example of the many ways of doing the same thing.
SQLAlchemy seems to be a well though tool so I guess there aren't several way just as a design consequence, there must be a benefit/cost for every solution. I'm looking for this information but I can't find it anywhere.
For the sake of the example, consider I want to update the profile of a given user with ID user_id. I can for example write the two following statements that (in appearance) the same effect:
profile = {'display_name': 'test', 'age': 34}
# 1)
user_profile = query.get(user_id)
for key, value in profile.items():
setattr(user_profile, key, value)
database.session.add(user_profile)
# 2)
user_profile = query.filter_by(id=user_id)
user_profile.update({**profile})
# 3)
with database.engine.connect() as conn:
stmt = (
sqlalchemy.update(models.UserProfile)
.values(**profile)
.where(models.UserProfile.id == db_user_id)
)
conn.execute(stmt)
In general what are the performances impact of all the different ways of updating a single row using SQLAlchemy?

Python: Using an SQL query and a nested for loop to read and reinsert data into a database?

A bit of clarity: I've only used python for two weeks as of now. I have a assignment at work to produce a script that will read a database, perform some calculations then update the database with the calculated values.
I know how to perform an SQL query using sqlite3 and i know how to update a table.
Currently I'm using something like:
c.execute("SELECT x,y,z FROM {tn} "\
.format(tn=Table1, x=column1, y=column2, z= column3))
To select the information.
I then perform the calculations, and can update the table using a single update statement per variable. So essentially it's quite inefficient.
I know that there is code to easily insert values into a table, i.e:
for t in [(1, 'Tim', 8, 'Mr.Wood'),
(2, 'Aaron', 8, 'Mr.Henderson'),
(2, 'Jane', 10, 'Mr.Wood'),
]:
c.execute('INSERT INTO Table_name VALUES (?,?,?,?)',t)
But what I'm having trouble finding is, is there a way I can incorporate that for loop and the update table? I.e. Read the values from a database (the columns aren't necessarily next to each other) then perform an update table query with a similar structure as the for loop above?
EDIT:
I am using dictionaries as a replacement for a switch/case statement.
I have the cases defined as functions and the switch function with the 'cases' defined.
For each case I will have different information to take from the database, and different calculations to perform. I can do this, and at the moment, I am using two-four UPDATE statements per function to update the database with the new calculated values. My question is: Is it possible to update a table with more than one value at one time using only one statement?
like this:
c.execute("UPDATE table SET 'value' WHERE columnName = {cn}
SET 'value' WHERE ColumnName2 = {cn2}
...
")
I'm not sure I fully understand, but I'll try to help you.
After you read the data from the DB, for each line of data, you perform a calculation and INSERT the data to the DB. Correct? If so, you should create a function, like so:
def update_db_based_on(some_info):
calculated_data = some_calculation(some_info)
insert_to_db(calculated_data)
Then, you could go like so:
for line in data_from_db:
update_db_based_on(line)
Is that what you meant?

What's the most efficient way to get this information from the database?

Related to this question:
Wildcards in column name for MySQL
Basically, there are going to be a variable number of columns with the name "word" in them.
What I want to know is, would it be faster to do a separate database call for each row (via getting the column information from the information schema), with a generated Python query per row, or would it be faster to simply SELECT *, and only use the columns I needed? Is it possible to say SELECT * NOT XYZ? As far as I can tell, no, there is no way to specifically exclude columns.
There aren't going to be many different rows at first - only three. But there's the potential for infinite rows in this. It's basically dependent on how many different types of search queries we want to put on our website. Our whole scalability is based around expanding the number of rows.
If all you are doing is limiting the number of columns returned there is no need to do a dynamic query. The hard work for the database is in selecting the rows matching your WHERE clause; it makes little difference to send you 5 columns out of 10, or all 10.
Just use a "SELECT * FROM ..." and use Python to pick out the columns from the result set. You'll use just one query to the database, so MySQL only has to work once, then filter out your columns:
cursor.execute('SELECT * FROM ...')
cols = [i for i, col in enumerate(cursor.description) if col[0].startswith('word')]
for row in cursor:
columns = [row[c] for c in cols]
You may have to use for row in cursor.fetchall() instead depending on your MySQL python module.

Efficient way of phrasing multiple tuple pair WHERE conditions in SQL statement

I want to perform an SQL query that is logically equivalent to the following:
DELETE FROM pond_pairs
WHERE
((pond1 = 12) AND (pond2 = 233)) OR
((pond1 = 12) AND (pond2 = 234)) OR
((pond1 = 12) AND (pond2 = 8)) OR
((pond1 = 13) AND (pond2 = 6547)) OR
((pond1 = 13879) AND (pond2 = 6))
I will have hundreds of thousands pond1-pond2 pairs. I have an index on (pond1, pond2).
My limited SQL knowledge came up with several approaches:
Run the whole query as is.
Batch the query up into smaller queries with n WHERE conditions
Save the pond1-pond2 pairs into a new table, and do a subquery in the WHERE clause to identify
Convert the python logic which identifies rows to delete into a stored procedure. Note that I am unfamiliar with programming stored procedures and thus this would probably involve a steep learning curve.
I am using postgres if that is relevant.
For a large number of pond1-pond2 pairs to be deleted in a single DELETE, I would create temporary table and join on this table.
-- Create the temp table:
CREATE TEMP TABLE foo AS SELECT * FROM (VALUES(1,2), (1,3)) AS sub (pond1, pond2);
-- Delete
DELETE FROM bar
USING
foo -- the joined table
WHERE
bar.pond1= foo.pond1
AND
bar.pond2 = foo.pond2;
I will do 3. (with JOIN rather than subquery) and measure time of DELETE query (without creating table and inserting). This is good starting point, because JOINing is very common and optimized procedure, so It will be hard to beat that time. Then you can compare that time to your current approach.
Also you can try following approach:
Sort pairs in same way as in index.
Delete using method 2. from your description (probably in single transaction).
Sorting before delete will give better index reading performance, because there's greater chance for hard-drive cache to work.
With hundred of thousands of pairs, you cannot do 1 (run the query as is), because the SQL statement would be too long.
3 is good if you have the pairs already in a table. If not, you would need to insert them first. If you do not need them later, you might just as well run the same amount of DELETE statements instead of INSERT statements.
How about a prepared statement in a loop, maybe batched (if Python supports that)
begin transaction
prepare statement "DELETE FROM pond_pairs WHERE ((pond1 = ?) AND (pond2 = ?))"
loop over your data (in Python), and run the statement with one pair (or add to batch)
commit
Where are the pairs coming from? If you can write a SELECT statements to identify them, you can just move this condition into the WHERE clause of your delete.
DELETE FROM pond_pairs WHERE (pond1, ponds) in (SELECT pond1, pond2 FROM ...... )

How to insert several thousand columns into sqlite3?

Similar to my last question, but I ran into problem lets say I have a simple dictionary like below but its Big, when I try inserting a big dictionary using the methods below I get operational error for the c.execute(schema) for too many columns so what should be my alternate method to populate an sql databases columns? Using the alter table command and add each one individually?
import sqlite3
con = sqlite3.connect('simple.db')
c = con.cursor()
dic = {
'x1':{'y1':1.0,'y2':0.0},
'x2':{'y1':0.0,'y2':2.0,'joe bla':1.5},
'x3':{'y2':2.0,'y3 45 etc':1.5}
}
# 1. Find the unique column names.
columns = set()
for _, cols in dic.items():
for key, _ in cols.items():
columns.add(key)
# 2. Create the schema.
col_defs = [
# Start with the column for our key name
'"row_name" VARCHAR(2) NOT NULL PRIMARY KEY'
]
for column in columns:
col_defs.append('"%s" REAL NULL' % column)
schema = "CREATE TABLE simple (%s);" % ",".join(col_defs)
c.execute(schema)
# 3. Loop through each row
for row_name, cols in dic.items():
# Compile the data we have for this row.
col_names = cols.keys()
col_values = [str(val) for val in cols.values()]
# Insert it.
sql = 'INSERT INTO simple ("row_name", "%s") VALUES ("%s", "%s");' % (
'","'.join(col_names),
row_name,
'","'.join(col_values)
)
If I understand you right, you're not trying to insert thousands of rows, but thousands of columns. SQLite has a limit on the number of columns per table (by default 2000), though this can be adjusted if you recompile SQLite. Never having done this, I do not know if you then need to tweak the Python interface, but I'd suspect not.
You probably want to rethink your design. Any non-data warehouse / OLAP application is highly unlikely to need or be terribly efficient with thousands of columns (rows, yes) and SQLite is not a good solution for a data warehouse / OLAP type situation. You may get a bit further with something like an entity-attribute-value setup (not a normal recommendation for genuine relational databases, but a valid application data model and much more likely to accommodate your needs without pushing the limits of SQLite too far).
If you really are adding a massive number of rows and are running into problems, maybe your single transaction is getting too large.
Do a COMMIT (commit()) after a given number of lines (or even after each insert as a test) if that is acceptable.
Thousands of rows should be easily doable with sqlite. Getting to millions and above, at some point there might be need for more. Depends on a lot of things, of course.

Categories