Executing an SQL update statement from a pandas dataframe - python

Context: I am using MSSQL, pandas, and pyodbc.
Steps:
Obtain dataframe from query using pyodbc (no problemo)
Process columns to generate the context of a new (but already existing) column
Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)
Now how do I execute the sql code in the auxilliary column, without looping through each row?
sample data
The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table
ID
raw
processed
strSQL
1
lorum.ipsum#test.com
lorum ipsum
UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2
rumlo.sumip#test.com
rumlo sumip
UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3
...
...
...
I would like to execute the SQL script in each row in an efficient manner.

After I recommended .executemany() in a comment to the question, a subsequent comment from #Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.
For an existing table named MillionRows
ID TextField
-- ---------
1 foo
2 bar
3 baz
…
and example data of the form
num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]
my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True
crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)
took about 180 seconds (3 minutes).
However, by creating a user-defined table type
CREATE TYPE dbo.TextField_ID AS TABLE
(
TextField nvarchar(255) NULL,
ID int NOT NULL,
PRIMARY KEY (ID)
)
and a stored procedure
CREATE PROCEDURE [dbo].[mr_update]
#tbl dbo.TextField_ID READONLY
AS
BEGIN
SET NOCOUNT ON;
UPDATE MillionRows SET TextField = t.TextField
FROM MillionRows mr INNER JOIN #tbl t ON mr.ID = t.ID
END
when I used
crsr.execute("{CALL mr_update (?)}", (rows,))
it did the same update in approximately 80 seconds (less than half the time).

Related

Python cx_Oracle executemany() how many rows inserted into DB

I use cx_oracle executemany function to insert data into a table in Oracle.
I would like to check after commit the action what is the actual amount of records append to the table.
can it be done and how?
Thanks
If you are using a cursor with the executemany method, then use the cursor's rowcount attribute to retrieve the number of rows affected by executemany.
There are many nuances associated with the Cursor objects executmany method fro SELECT and DML statements. Take a look at the cx_Oracle documentation for details at https://cx-oracle.readthedocs.io/en/latest/user_guide/batch_statement.html#batchstmnt
It may be helpful if you could post a code snippet of what is being attempted to elicit a more accurate response.
I don't use Python, but - as your question is related to Oracle and if you can "move" the insert process into a stored procedure you'd then call from Python, you could utilize sql%rowcount attribute which returns number of rows affected by the most recent SQL statement ran from within PL/SQL.
Here's an example:
SQL> set serveroutput on
SQL> begin
2 insert into test (id, name)
3 select 1, 'Little' from dual union all
4 select 2, 'Foot' from dual union all
5 select 3, 'Amir' from dual;
6 dbms_output.put_line('Inserted ' || sql%rowcount || ' row(s)');
7 end;
8 /
Inserted 3 row(s)
^
|
value returned by SQL%ROWCOUNT
PL/SQL procedure successfully completed.
SQL>
Method executemany has a parameter named "parameters". The "parameters" is a list of sequences/dictionaries. Size of this list determines how many times the statement is executed (each time the statement is executed, database returns number of rows affected). Finaly you can get this information, but it will be a list of integers ( one for each execution)
Let me :
SQL> Create table abc_1
(id number)
;
Table created
Python:
dsn = cx_Oracle.makedsn("xxx.xxx.xxx.xxx", 1521, service_name="xxxx")
con = cx_Oracle.connect(user="xxxx", password="xxxx", dsn=dsn)
tab_cursor = con.cursor()
tab_query = '''
insert into abc_1 (id) select :x from dual where :y >= level connect by level<=10
'''
foo = {"x": 1, "y": 5} # it will insert 5 rows
foo2 = {"x": 1, "y": 6} # it will insert 6 rows
foo3 = {"x": 1, "y": 10} # it will insert 10 rows
tab_cursor.executemany(tab_query, parameters=[foo, foo2, foo3], arraydmlrowcounts=True)
print("Rows inserted:", tab_cursor.getarraydmlrowcounts())
Output shows how many rows was inserted for each execution:
Rows inserted: [5, 6, 10]

psycopg2 Syntax errors at or near "' '"

I have a dataframe named Data2 and I wish to put values of it inside a postgresql table. For reasons, I cannot use to_sql as some of the values in Data2 are numpy arrays.
This is Data2's schema:
cursor.execute(
"""
DROP TABLE IF EXISTS Data2;
CREATE TABLE Data2 (
time timestamp without time zone,
u bytea,
v bytea,
w bytea,
spd bytea,
dir bytea,
temp bytea
);
"""
)
My code segment:
for col in Data2_mcw.columns:
for row in Data2_mcw.index:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
cursor.execute(
"""
INSERT INTO Data2_mcw(%s)
VALUES (%s)
"""
,
(col.replace('\"',''),value)
)
Error generated:
psycopg2.errors.SyntaxError: syntax error at or near "'time'"
LINE 2: INSERT INTO Data2_mcw('time')
How do I rectify this error?
Any help would be much appreciated!
There are two problems I see with this code.
The first problem is that you cannot use bind parameters for column names, only for values. The first of the two %s placeholders in your SQL string is invalid. You will have to use string concatenation to set column names, something like the following (assuming you are using Python 3.6+):
cursor.execute(
f"""
INSERT INTO Data2_mcw({col})
VALUES (%s)
""",
(value,))
The second problem is that a SQL INSERT statement inserts an entire row. It does not insert a single value into an already-existing row, as you seem to be expecting it to.
Suppose your dataframe Data2_mcw looks like this:
a b c
0 1 2 7
1 3 4 9
Clearly, this dataframe has six values in it. If you were to run your code on this dataframe, then it would insert six rows into your database table, one for each value, and the data in your table would look like the following:
a b c
1
3
2
4
7
9
I'm guessing you don't want this: you'd rather your database table contained the following two rows instead:
a b c
1 2 7
3 4 9
Instead of inserting one value at a time, you will have to insert one entire row at time. This means you have to swap your two loops around, build the SQL string up once beforehand, and collect together all the values for a row before passing it to the database. Something like the following should hopefully work (please note that I don't have a Postgres database to test this against):
column_names = ",".join(Data2_mcw.columns)
placeholders = ",".join(["%s"] * len(Data2_mcw.columns))
sql = f"INSERT INTO Data2_mcw({column_names}) VALUES ({placeholders})"
for row in Data2_mcw.index:
values = []
for col in Data2_mcw.columns:
value = Data2_mcw[col].loc[row]
if type(value).__module__ == np.__name__:
value = pickle.dumps(value)
values.append(value)
cursor.execute(sql, values)

Update multiple rows of SQL table from Python script

I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()

UPDATE a field with a value from another table

I need to update a field with a value from another table in MySQL, using Python Connector (not that important though). I need to select a value from one table based on a matching criteria and insert the extracted column back into the previous table based on the same matching criteria.
I have the following, which doesn't work of cource.
for match_field in list:
cursor_importer.execute(UPDATE table1 SET table1_field =
(SELECT field_new FROM table2 WHERE match_field = %s)
WHERE match_field = %s LIMIT 1,
(match_field, match_field ))
You can use UPDATE with JOINS.
Below is an example in MySQL:
UPDATE table1 a JOIN table2 b ON a.match_field = b.match_field
SET a.table1_field = b.field_new
WHERE a.match_field = 'filter criteria'

Merge tables from two different databases - sqlite3/Python

I have two different SQLite databases XXX and YYY.
XXX contains table A and YYY contains B respectively.
A and B have same structure(columns).
How to append the rows of B in A in Python - SQLite API.
After appending A contains rows of A and rows of B.
You first get a connection to the database using sqlite3.connect, then create a cursor so you can execute sql. Once you have a cursor, you can execute arbitrary sql commands.
Example:
import sqlite3
# Get connections to the databases
db_a = sqlite3.connect('database_a.db')
db_b = sqlite3.connect('database_b.db')
# Get the contents of a table
b_cursor = db_b.cursor()
b_cursor.execute('SELECT * FROM mytable')
output = b_cursor.fetchall() # Returns the results as a list.
# Insert those contents into another table.
a_cursor = db_a.cursor()
for row in output:
a_cursor.execute('INSERT INTO myothertable VALUES (?, ?, ...etc..., ?, ?)', row)
# Cleanup
db_a.commit()
a_cursor.close()
b_cursor.close()
Caveat: I haven't actually tested this, so it might have a few bugs in it, but the basic idea is sound, I think.
This is a generalized function and should be customized to your particular environment. To do this, you may structure the "dynamically determine SQL expression requirements" section with the static SQL parameters (rather than PRAGMA table_info). This should improve performance.
import sqlite3
def merge_tables(cursor_new: sqlite3.Cursor, cursor_old: sqlite3.Cursor, table_name: str, del_old_table: bool = False) -> None:
'''
This function merges the content of a specific table from an old cursor into a new cursor.
:param cursor_new: [sqlite3.Cursor] the primary cursor
:param cursor_old: [sqlite3.Cursor] the secondary cursor
:param table_name: [str] the name of the table
:return: None
'''
# dynamically determine SQL expression requirements
column_names = cursor_new.execute(f"PRAGMA table_info({table_name})").fetchall()
column_names = tuple([x[1] for x in column_names][1:]) # remove the primary keyword
values_placeholders = ', '.join(['?' for x in column_names]) # format appropriately
# SQL select columns from table
data = cursor_old.execute(f"SELECT {', '.join(column_names)} FROM {table_name}").fetchall()
# insert the data into the primary cursor
cursor_new.executemany(f"INSERT INTO {table_name} {column_names} VALUES ({values_placeholders})", data)
if (cursor_new.connection.commit() == None):
# With Ephemeral RAM connections & testing, deleting the table may be ill-advised
if del_old_table:
cursor_old.execute(f"DELETE FROM {table_name}") # cursor_old.execute(f'DROP TABLE {table_name}')
cursor_old.connection.commit()
print(f"Table {table_name} merged from {cursor_old.connection} to {cursor_new.connection}") # Consider logging.info()
return None

Categories