Iteratively INSERTing from a Dataframe - python

My question may be out of pure ignorance. Given an arbitrary dataframe of say 5 rows. I want to insert that dataframe into a DB (in my case it's postgresSQL). General code to do that is along the lines of:
postgres_insert_query = """ INSERT INTO table (ID, MODEL, PRICE) VALUES (%s,%s,%s)""" record_to_insert = (1, 'A', 100) cursor.execute(postgres_insert_query, record_to_insert)
Is it a common practice that when inserting more than one row of data, you iterate over your rows and do that?
It appears that every article or example I see is about inserting a single row to a DB.

In python you could simply loop over your data frame and then do your inserts.
for record in dataframe:
sql = '''INSERT INTO table (col1, col2, col3)
VALUES ('{}', '{}', '{}')
'''.format(record[1], record[0], record[2])
dbo.execute(sql)
This is highly simplistic. You may want to use something like sqlalchemy and make surre you use prepared statements. Never overlook security.

Related

INSERT SQL records from one database to a second (where 2nd has an additional column)

I'd like to insert select records from Table A --> Table B (in this example case, different "databases" == different tables to not worry about ATTACH), where Table A has less columns than Table B. The additional B_Table column (col3) should also be populated.
I've tried this sequence in raw-SQL (through SQLAlch.):
1.) INSERTing A_Table into Table B using an engine.connect().execute(text)
text("INSERT INTO B_Table (col1, col2) SELECT col1, col2 FROM A_Table")
2.) UPDATEing B_Table w/ col3 info with an engine.connect()ion (all newly inserted records are populated/updated w/ the same identifier, NewInfo)
text("UPDATE B_Table SET col3 = NewInfo WHERE B_Table.ID >= %s" % (starting_ID#_of_INSERT'd_records))
More efficient alternative?
But this is incredibly inefficient. It takes 4x longer to UPDATE a single column than to INSERT. This seems like it should be a fraction of the INSERT time. I'd like to reduce the total time to ~just the insertion time.
What's a better way to copy data from one table to another w/out INSERTing followed by an UPDATE? I was considering a:
1.) SQLAlchemy session.query(A_Table), but wasn't sure how to then edit that object (for col3) and then insert that updated object w/out loading all the A_Table queried info into RAM (which I understand raw-SQL's INSERT does not do).
You can use 'NewInfo' as a string literal in the SELECT statement:
INSERT INTO B_Table (col1, col2, col3)
SELECT col1, col2, 'NewInfo'
FROM A_Table;

Why is this sql statement super slow?

I am writing large amounts of data to a sqlite database. I am using a temporary dataframe to find unique values.
This sql code takes forever in conn.execute(sql)
if upload_to_db == True:
print(f'########################################WRITING TO TEMP TABLE: {symbol} #######################################################################')
master_df.to_sql(name='tempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO instrumentsHistory (datetime, instrumentSymbol, observation, observationColName)
SELECT t.datetime, t.instrumentSymbol, t.observation, t.observationColName
FROM tempTable t
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)"""
print(f'##############################################WRITING TO FINAL TABLE: {symbol} #################################################################')
cn.execute(sql)
running this takes forever to write to the database. Can someone help me understand how to speed it up?
Edit 1:
How many rows roughly? -About 15,000 at a time. Basically it is pulling data into a pandas dataframe and making some transformations and then writing it to a sqlite database. there are probably 600 different instruments and each having like 15,000 rows so 9M rows ultimately. Give or take a million....
Depending on your SQL database, you could try using something like INSERT INTO IGNORE (MySQL), or MERGE (e.g. on Oracle), which would do the insert only if it would not violate a primary key or unique constraint. This would assume that such a constraint would exist on the 4 columns which you are checking.
In the absence of merge, you could try adding the following index to the instrumentsHistory table:
CREATE INDEX idx ON instrumentsHistory (datetime, instrumentSymbol, observation,
observationColName);
This index would allow for rapid lookup of each incoming record, coming from the tempTable, and so might speed up the insert process.
This subquery
WHERE NOT EXISTS
(SELECT 1 FROM instrumentsHistory f
WHERE t.datetime = f.datetime
AND t.instrumentSymbol = f.instrumentSymbol
AND t.observation = f.observation
AND t.observationColName = f.observationColName)
has to check every row in the table - and match four columns - until a match is found. In the worst case, there is no match and a full table scan must be completed. Therefore, the performance of the query will deteriorate as the table grows in size.
The solution, as mentioned in Tim's answer, is to create an index over the four columns to that the db can quickly determine whether a match exists.

How to delete large quantity of records from Oracle Table that has no primary key

The situation: I'm loading an entire SQL table into my program. For convenience I'm using pandas to maintain the row data. I am then creating a dataframe of rows I would like to have removed from the SQL table. Unfortunately (and I can't change this) the table does not have any primary keys other than the built-in Oracle ROWID (which isn't a real table column its a pseudocolumn), but I can make ROWID part of my dataframe if I need to.
The table has hundreds of thousands of rows, and I'll probably be deleting a few thousand records with each run of the program.
Question:
Using Cx_Oracle what is the best method of deleting multiple rows/records that don't have a primary key? I don't think creating a loop to submit thousands of delete statements is very efficient or pythonic. Although I am concerned about building a singular SQL delete statement keyed off of ROWID and that contains a clause with thousands of items:
Where ROWID IN ('eg1','eg2',........, 'eg2345')
Is this concern valid? Any Suggestions?
Using ROWID
Since you can use ROWID, that would be the ideal way to do it. And depending on the Oracle version, the query length limit may be large enough for a query with that many elements in the IN clause. The issue is the number of elements in the IN expression list - limited to 1000.
So you'll either have to break up the list of RowIDs into sets of 1000 at a time or delete just a single row at a time; with or without executemany().
>>> len(delrows) # rowids to delete
5000
>>> q = 'DELETE FROM sometable WHERE ROWID IN (' + ', '.join(f"'{row}'" for row in delrows) + ')'
>>> len(q) # length of the query
55037
>>> # let's try with just the first 1000 id's and no extra spaces
... q = 'DELETE FROM sometable WHERE ROWID IN (' + ','.join(f"'{row}'" for row in delrows[:1000]) + ')'
>>> len(q)
10038
You're probably within query-length limits, and can even save some chars with a minimal ',' item separator.
Without ROWID
Without the Primary Key or ROWID, the only way to identify each row is to specify all the columns in the WHERE clause and to do many rows at a time, they'll need to be OR'd together:
DELETE FROM sometable
WHERE ( col1 = 'val1'
AND col2 = 'val2'
AND col3 = 'val3' ) -- row 1
OR ( col1 = 'other2'
AND col2 = 'value2'
AND col3 = 'val3' ) -- row 2
OR ( ... ) -- etc
As you can see it's not the nicest query to construct but allows you to do it without ROWIDs.
And in both cases, you probably don't need to be using parameterised queries since the IN list in 1 or OR grouping in 2 is variable. (Yes, you could create it parameterised after constructing the whole extended SQL with thousands of parameters. Not sure what the limit is on that.) The executemany() way is definitely easier to write & do but for speed, the single large queries (either of the above two) will probably outperform executemany with thousands of items.
You can use cursor.executemany() to delete multiple rows at once. Something like the following should work:
dataToDelete = [['eg1'], ['eg2'], ...., ['eg2345']]
cursor.executemany("delete from sometable where rowid = :1", dataToDelete)

python sqlite3 select multiple rows with duplicates

I am wanting to perform random samples from a large database, and I am wanting those samples to be paired, which means that I either I care about the order of results from a (series of) select statement(s) or reorder afterwards. Additionally, there may be duplicate rows as well. This is fine, but I want an efficient way to make these samples straight from the db. I understand that SELECT statements cannot be used with cursor.executemany but really that is what I would like.
There is a similar question here
where the OP seems to be asking for a multi-select, but it happy with the current top answer which suggests using IN in the where clause. This is not what I am looking for really. I'd prefer something more like ken.ganong's solution, but wonder about the efficiency of this.
More precisely, I do something like the following:
import sqlite3
import numpy as np
# create the database and inject some values
values = [
(1, "Hannibal Smith", "Command"),
(2, "The Faceman", "Charm"),
(3, "Murdock", "Pilot"),
(4, "B.A. Baracas", "Muscle")]
con = sqlite3.connect('/tmp/test.db')
cur = con.cursor()
cur.execute(
'CREATE TABLE a_team (tid INTEGER PRIMARY KEY, name TEXT, role TEXT)')
con.commit()
cur.executemany('INSERT INTO a_team VALUES(?, ?, ?)', values)
con.commit()
# now let's say that I have these pairs of values I want to select role's for
tid_pairs = np.array([(1,2), (1,3), (2,1), (4,3), (3,4), (4,3)])
# what I currently do is run multiple selects, insert into a running
# list and then numpy.array and reshape the result
out_roles = []
select_query = "SELECT role FROM a_team WHERE tid = ?"
for tid in tid_pairs.flatten():
cur.execute(select_query, (tid,))
out_roles.append(cur.fetchall()[0][0])
#
role_pairs = np.array(out_roles).reshape(tid_pairs.shape)
To me it seems like there must be a more efficient way of passing a SELECT statement to the db which requests multiple rows each with their own constrants, but as I say executemany cannot be used with a SELECT statement. The alternative is to use an IN constraint in the WHERE clause then make the duplicates within python.
There are a few extra constraints, for instance, I may have non-existing rows in the db and I may want to handle that by dropping an output pair, or replacing with a default value, but these things are a side issue.
Thanks in advance.

Bulk insert with returning IDs performance

I've got a table which I want to insert around 1000 items each query, and get their PK after creation, for later use as FK for other tables.
I've tried inserting them using returning syntax in postgresql.
but it takes around 10 sec to insert
INSERT INTO table_name (col1, col2, col3) VALUES (a1,a2,a3)....(a(n-2),a(n-1),a(n)) returning id;
By removing RETURNING I get much better performance ~50ms.
I think that if I can get an atomic operation to get the first id and insert the rows at the same time I could get better performance by removing the RETURNING.
but don't understand if that is possible.
Generate id using the nextval
http://www.postgresql.org/docs/9.1/static/sql-createsequence.html
CREATE TEMP TABLE temp_val AS(
VALUES(nextval('table_name_id_seq'),a1,a2,a3),
(nextval('table_name_id_seq'),a1,a2,a3)
);
INSERT INTO table_name (id, col1, col2, col3)(
SELECT column1,column2,column3,column4
FROM temp_val
);
SELECT column1 FROM temp_val;

Categories