How to specify a column's default value as a placeholder? - python

I've got a table with multiple columns, several of which are optional. I'm reading records from an external source, in which each record may specify values for the optional columns or not. For each record, I'd like to insert a row into the database with the given values plus the column defaults for any column that's not specified.
If all the columns are specified, I obviously just use a basic INSERT statement:
db_cursor.execute("insert into table (col1, col2, col3, col4, col5) " +
"values (%s, %s, %s, %s, %s)",
(value_1, value_2, value_3, value_4, value_5))
However, if some values are unspecified, there doesn't seem to be an easy way to use the defaults for only those values. You can use the DEFAULT keyword in SQL (or, equivalently, leave those columns out of the insert statement entirely), as e.g.
db_cursor.execute("insert into table (col1, col2, col3, col4, col5) " +
"values (%s, %s, %s, DEFAULT, %s)",
(value_1, value_2, value_3, value_5))
But you can't pass 'DEFAULT' as a placeholder value; it'll just become that string.
So far I can only think of three approaches to this problem:
Construct the SQL query string itself at run-time based on the input data, rather than using parameterization. This is a very strong anti-pattern due to the usual SQL injection reasons. (This application isn't actually security-critical, but I don't want such anti-patterns in my code.)
Write a different query string for each possible combination of specified and unspecified parameters. Here, if four of the columns are optional, that's 2^4 = 16 different commands running the same query. This is obviously unworkable.
Make the application aware of the default values and have it send them explicitly in the case where a column is unspecified. This breaks SPOT for the defaults, with all the attending maintenance and interoperability headaches (multiple applications read the database). Of the approaches I can think of, this is probably least bad, but I'd still prefer not to have to do it.
Is there an easier way to manage dynamically sending defaults?

The way I usually deal with this is to have a placeholder in place of the column list and string format() the list of columns. This is safe, as the list of columns is controlled by the dev, and isn't untrusted user input.
stmt_without_col_names = 'INSERT INTO table ({}) VALUES ({})'
input_values = [1, None, 1, None, None]
columns = ('col1', 'col2', 'col3', 'col4', 'col5')
columns_to_keep = {k: v for k, v in zip(columns, input_values) if v is not None}
# note: relies on dict key ordering remaining the same
# this is true if the dict is not modified *at all* between creation
# and the statement execution - use an OrderedDict or other data
# structure instead if you're worried
format_str = ','.join(['%s'] * len(columns_to_keep))
stmt = stmt_without_col_names.format(columns_to_keep.keys(), format_str)
# stmt looks like "INSERT INTO table (['col3', 'col1']) VALUES (%s,%s)"
cursor.execute(stmt, columns_to_keep.values())

Related

INSERT SQL records from one database to a second (where 2nd has an additional column)

I'd like to insert select records from Table A --> Table B (in this example case, different "databases" == different tables to not worry about ATTACH), where Table A has less columns than Table B. The additional B_Table column (col3) should also be populated.
I've tried this sequence in raw-SQL (through SQLAlch.):
1.) INSERTing A_Table into Table B using an engine.connect().execute(text)
text("INSERT INTO B_Table (col1, col2) SELECT col1, col2 FROM A_Table")
2.) UPDATEing B_Table w/ col3 info with an engine.connect()ion (all newly inserted records are populated/updated w/ the same identifier, NewInfo)
text("UPDATE B_Table SET col3 = NewInfo WHERE B_Table.ID >= %s" % (starting_ID#_of_INSERT'd_records))
More efficient alternative?
But this is incredibly inefficient. It takes 4x longer to UPDATE a single column than to INSERT. This seems like it should be a fraction of the INSERT time. I'd like to reduce the total time to ~just the insertion time.
What's a better way to copy data from one table to another w/out INSERTing followed by an UPDATE? I was considering a:
1.) SQLAlchemy session.query(A_Table), but wasn't sure how to then edit that object (for col3) and then insert that updated object w/out loading all the A_Table queried info into RAM (which I understand raw-SQL's INSERT does not do).
You can use 'NewInfo' as a string literal in the SELECT statement:
INSERT INTO B_Table (col1, col2, col3)
SELECT col1, col2, 'NewInfo'
FROM A_Table;

Iteratively INSERTing from a Dataframe

My question may be out of pure ignorance. Given an arbitrary dataframe of say 5 rows. I want to insert that dataframe into a DB (in my case it's postgresSQL). General code to do that is along the lines of:
postgres_insert_query = """ INSERT INTO table (ID, MODEL, PRICE) VALUES (%s,%s,%s)""" record_to_insert = (1, 'A', 100) cursor.execute(postgres_insert_query, record_to_insert)
Is it a common practice that when inserting more than one row of data, you iterate over your rows and do that?
It appears that every article or example I see is about inserting a single row to a DB.
In python you could simply loop over your data frame and then do your inserts.
for record in dataframe:
sql = '''INSERT INTO table (col1, col2, col3)
VALUES ('{}', '{}', '{}')
'''.format(record[1], record[0], record[2])
dbo.execute(sql)
This is highly simplistic. You may want to use something like sqlalchemy and make surre you use prepared statements. Never overlook security.

How to delete large quantity of records from Oracle Table that has no primary key

The situation: I'm loading an entire SQL table into my program. For convenience I'm using pandas to maintain the row data. I am then creating a dataframe of rows I would like to have removed from the SQL table. Unfortunately (and I can't change this) the table does not have any primary keys other than the built-in Oracle ROWID (which isn't a real table column its a pseudocolumn), but I can make ROWID part of my dataframe if I need to.
The table has hundreds of thousands of rows, and I'll probably be deleting a few thousand records with each run of the program.
Question:
Using Cx_Oracle what is the best method of deleting multiple rows/records that don't have a primary key? I don't think creating a loop to submit thousands of delete statements is very efficient or pythonic. Although I am concerned about building a singular SQL delete statement keyed off of ROWID and that contains a clause with thousands of items:
Where ROWID IN ('eg1','eg2',........, 'eg2345')
Is this concern valid? Any Suggestions?
Using ROWID
Since you can use ROWID, that would be the ideal way to do it. And depending on the Oracle version, the query length limit may be large enough for a query with that many elements in the IN clause. The issue is the number of elements in the IN expression list - limited to 1000.
So you'll either have to break up the list of RowIDs into sets of 1000 at a time or delete just a single row at a time; with or without executemany().
>>> len(delrows) # rowids to delete
5000
>>> q = 'DELETE FROM sometable WHERE ROWID IN (' + ', '.join(f"'{row}'" for row in delrows) + ')'
>>> len(q) # length of the query
55037
>>> # let's try with just the first 1000 id's and no extra spaces
... q = 'DELETE FROM sometable WHERE ROWID IN (' + ','.join(f"'{row}'" for row in delrows[:1000]) + ')'
>>> len(q)
10038
You're probably within query-length limits, and can even save some chars with a minimal ',' item separator.
Without ROWID
Without the Primary Key or ROWID, the only way to identify each row is to specify all the columns in the WHERE clause and to do many rows at a time, they'll need to be OR'd together:
DELETE FROM sometable
WHERE ( col1 = 'val1'
AND col2 = 'val2'
AND col3 = 'val3' ) -- row 1
OR ( col1 = 'other2'
AND col2 = 'value2'
AND col3 = 'val3' ) -- row 2
OR ( ... ) -- etc
As you can see it's not the nicest query to construct but allows you to do it without ROWIDs.
And in both cases, you probably don't need to be using parameterised queries since the IN list in 1 or OR grouping in 2 is variable. (Yes, you could create it parameterised after constructing the whole extended SQL with thousands of parameters. Not sure what the limit is on that.) The executemany() way is definitely easier to write & do but for speed, the single large queries (either of the above two) will probably outperform executemany with thousands of items.
You can use cursor.executemany() to delete multiple rows at once. Something like the following should work:
dataToDelete = [['eg1'], ['eg2'], ...., ['eg2345']]
cursor.executemany("delete from sometable where rowid = :1", dataToDelete)

python sqlite3 select multiple rows with duplicates

I am wanting to perform random samples from a large database, and I am wanting those samples to be paired, which means that I either I care about the order of results from a (series of) select statement(s) or reorder afterwards. Additionally, there may be duplicate rows as well. This is fine, but I want an efficient way to make these samples straight from the db. I understand that SELECT statements cannot be used with cursor.executemany but really that is what I would like.
There is a similar question here
where the OP seems to be asking for a multi-select, but it happy with the current top answer which suggests using IN in the where clause. This is not what I am looking for really. I'd prefer something more like ken.ganong's solution, but wonder about the efficiency of this.
More precisely, I do something like the following:
import sqlite3
import numpy as np
# create the database and inject some values
values = [
(1, "Hannibal Smith", "Command"),
(2, "The Faceman", "Charm"),
(3, "Murdock", "Pilot"),
(4, "B.A. Baracas", "Muscle")]
con = sqlite3.connect('/tmp/test.db')
cur = con.cursor()
cur.execute(
'CREATE TABLE a_team (tid INTEGER PRIMARY KEY, name TEXT, role TEXT)')
con.commit()
cur.executemany('INSERT INTO a_team VALUES(?, ?, ?)', values)
con.commit()
# now let's say that I have these pairs of values I want to select role's for
tid_pairs = np.array([(1,2), (1,3), (2,1), (4,3), (3,4), (4,3)])
# what I currently do is run multiple selects, insert into a running
# list and then numpy.array and reshape the result
out_roles = []
select_query = "SELECT role FROM a_team WHERE tid = ?"
for tid in tid_pairs.flatten():
cur.execute(select_query, (tid,))
out_roles.append(cur.fetchall()[0][0])
#
role_pairs = np.array(out_roles).reshape(tid_pairs.shape)
To me it seems like there must be a more efficient way of passing a SELECT statement to the db which requests multiple rows each with their own constrants, but as I say executemany cannot be used with a SELECT statement. The alternative is to use an IN constraint in the WHERE clause then make the duplicates within python.
There are a few extra constraints, for instance, I may have non-existing rows in the db and I may want to handle that by dropping an output pair, or replacing with a default value, but these things are a side issue.
Thanks in advance.

Can I split INSERT statement into several ones without repeat inserting rows?

I have such an INSERT statement:
mtemp = "station, calendar, type, name, date, time"
query = "INSERT INTO table (%s) VALUES ( '%s', '%s', '%s', %s, '%s', '%s' );"
query = query % (mtemp, mstation, mcalendar, mtype, mname, mdate, mtime)
curs.execute(query, )
conn.commit()
The problem is that I can not get the variables: mcalendar, mdate, mtime in this statement. They are not constant values. I would have to access each of them within a forloop. However, the values of mstation, mtype and mname are fixed. I tried to split the INSERT statement into several ones: one for each of the three variables in a forloop, and one for the three fixed values in a single forloop. The forloop is basically to define when to insert rows. I have a list of rows1 and a list of rows2, rows1 is a full list of records while rows2 lack some of them. I’m checking if the rows2 record exist in rows1. If it does, then execute the INSERT statement, if not, do nothing.
I ran the codes and found two problems:
It’s inserting way more rows than it is supposed to. It’s supposed to insert no more than 240 rows for there are only 240 time occurrences in each day for each sensor. (I wonder if it is because I wrote too many forloop so that it keeps inserting rows). Now it’s getting more than 400 new rows.
In these new rows being inserted to the table, they only have values in the columns of fixed value. For the three ones that I use the single forloop to insert data, they don’t have value at all.
Hope someone give me some tip here. Thanks in advance! I can put more codes here if needed. I’m not even sure if I’m in the right track.
I'm not sure I understand exactly your scenario, but is this the sort of thing you need?
Pseudo code
mstation = "foo"
mtype = "bar"
mname = "baz"
mtemp = "station, calendar, type, name, date, time"
queryTemplate = "INSERT INTO table (%s) VALUES ( '%s', '%s', '%s', %s, '%s', '%s' );"
foreach (mcalendar in calendars)
foreach (mdate in dates)
foreach (mtime in times)
query = queryTemplate % (mtemp, mstation, mcalendar, mtype, mname, mdate, mtime)
curs.execute(query, )
One INSERT statement always corresponds to one new row in a table. (Unless of course there is an error during the insert.) You can INSERT a row, and then UPDATE it later to add/change information but there is no such thing as splitting up an INSERT.
If you have a query which needs to be executed multiple times with changing data, the best option is a prepared statement. A prepared statement "compiles" an SQL query but leaves placeholders that can set each time it is executed. This improves performance because the statement doesn't need to be parsed each time. You didn't specify what library you're using to connect to postgres so I don't know what the syntax would be, but it's something to look in to.
If you can't/don't want to use prepared statements, you'll have to just create the query string once for each insert. Don't substitute the values in before the loop, wait until you know them all before creating the query.
Following syntax works in SQL Server 2008 but not in SQL Server 2005.
CREATE TABLE Temp (id int, name varchar(10));
INSERT INTO Temp (id, name) VALUES (1, 'Anil'), (2, 'Ankur'), (3, 'Arjun');
SELECT * FROM Temp;
id | name
------------
1 | Anil
2 | Ankur
3 | Arjun

Categories