Update SQL Database based on matched ID in Dataframe

Update SQL Database based on matched ID in Dataframe - python

I have the dataframe below with the respective values and would like to update my SQL Database Server if the ID matches with my dataframe
df dataframe
ID
VALUE
123
9
456
11
SQL Database Server, table1
ID
VALUE
456
62
623
41
123
3
563
67
After updating, I want my SQL Database Server to look like this where you'll notice that ID 123 & 456 has been given a new value based on my dataframe.
ID
VALUE
456
11
623
41
123
9
563
67
Anyone knows how I could utilise this in my query when executing?
query = DELETE/UPDATE table table1 where ID = ID IN DATAFRAME
conn.execute(query)

You can create a parameter list(df_list) along with a DML statement, and arrange the order of columns due to the appearance within the statement. In this case those two arguments(id and value) should be reversely ordered such as
cur=con.cursor()
sql = "UPDATE [table1] SET [value] = ? WHERE [id] = ?"
cols = df.columns.tolist()
df_list = df[cols[-1:] + cols[:-1]].values.tolist()
cur.executemany(sql,df_list)
cur.close()
con.commit()
con.close()

You can simply make corelated query as follows:
update table1 t1
set t1.value = (select df.value from df where df.id = t1.id)
where exists (select 1 from df where df.id = t1.id);
OR use Inner join in update as follows:
UPDATE T
SET T.value = d.value -- , another column updates here
FROM table1 as t
INNER JOIN df as d ON t.id = d.id;

Related

Selecting rows from sql if they are also in a dataframe

I have a MS sql server with a lot of rows( around 4 million) from all the customers and their information.
I can also get a list of phone numbers of all visitors of my website in a given timeframe that I can get in a csv file and then covert to a dataframe in python. What I want to do is to select two columns from my server(one is the phone number and the other one is a property of that person) but I only want to select this records from people who are in both my dataframe and my server.
What I currently do is selecting all customers from sql server and then merge them with my dataframe. But obviously this is not very fast. Is there any way to do this faster?
query2 = """
SELECT encrypt_phone, col2
FROM DatabaseTable
"""
cursor.execute(query2)
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
df1.merge(df2, how='inner', indicator=True)

If your DataFrame have not many rows, I would do it the simple way as here :
V = df["colx"].unique()
Q = 'SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN ({})'.format(','.join(['?']*len(V)))
cursor.execute(Q, tuple(V))
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
NB : colx and coly are the columns that refer to the customers (id, or name, ..) in the pandas DataFrame and in the SQL table, respectively.
Otherwise, you may need to store df1 as a table in your DB and then perform a sub-query :
df1.to_sql('DataFrameTable', conn, index=False) #this will store df1 in the DB
Q = "SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN (SELECT colx FROM DataFrameTable)"
df2 = pd.read_sql_query(Q, conn)

DuckDB - efficiently insert pandas dataframe to table with sequence

CREATE TABLE temp (
id UINTEGER,
name VARCHAR,
age UINTEGER
);
CREATE SEQUENCE serial START 1;
Insertion with series works just fine:
INSERT INTO temp VALUES(nextval('serial'), 'John', 13)
How I can use the sequence with pandas dataframe?
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
con.execute("INSERT INTO temp SELECT * FROM df")
RuntimeError: Binder Error: table temp has 3 columns but 2 values were supplied
I don't want to iterate item by item. The goal is to efficiently insert 1000s of items from python to DB. I'm ok to change pandas to something else.

Can't you have nextval('serial') as part of your select query when reading the df?
e.g.,
con.execute("INSERT INTO temp SELECT nextval('serial'), Name, Age FROM df")

Executing an SQL update statement from a pandas dataframe

Context: I am using MSSQL, pandas, and pyodbc.
Steps:
Obtain dataframe from query using pyodbc (no problemo)
Process columns to generate the context of a new (but already existing) column
Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)
Now how do I execute the sql code in the auxilliary column, without looping through each row?
sample data
The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table
ID
raw
processed
strSQL
1
lorum.ipsum#test.com
lorum ipsum
UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2
rumlo.sumip#test.com
rumlo sumip
UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3
...
...
...
I would like to execute the SQL script in each row in an efficient manner.

After I recommended .executemany() in a comment to the question, a subsequent comment from #Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.
For an existing table named MillionRows
ID TextField
-- ---------
1 foo
2 bar
3 baz
…
and example data of the form
num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]
my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True
crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)
took about 180 seconds (3 minutes).
However, by creating a user-defined table type
CREATE TYPE dbo.TextField_ID AS TABLE
(
TextField nvarchar(255) NULL,
ID int NOT NULL,
PRIMARY KEY (ID)
)
and a stored procedure
CREATE PROCEDURE [dbo].[mr_update]
#tbl dbo.TextField_ID READONLY
AS
BEGIN
SET NOCOUNT ON;
UPDATE MillionRows SET TextField = t.TextField
FROM MillionRows mr INNER JOIN #tbl t ON mr.ID = t.ID
END
when I used
crsr.execute("{CALL mr_update (?)}", (rows,))
it did the same update in approximately 80 seconds (less than half the time).

compare each column of 2 tables and write matching rows in a 3rd table using loop in python

compare each column of 2 tables and write matching rows in a 3rd table and non-matching rows as "Not Mapped" in 3rd table using loop in python:
Compare first column of table A to first column of Table B, if this is true, then compare 2nd column of Table A with 2nd column of Table B, if this is also TRUE, compare 3rd column of Table A with 3rd column of Table B, if this is also True; then write matching rows into a new table C else write in table C as "Not Mapped"
I am not able to write proper code for this since I am new to python. Any help would be appreciated.
I have 2 tables:
Table A
employeeiD employee, managerID, DirectorID, Date
12 A 100 234 2017-01-01
13 B 101 235 2017-01-01
14 C 104 234 2017-01-02
15 D 101 236 2017-01-01
and Table B as:
Table B:
employeeID managerID DirectorID Director
12 100 234 X
12 101 235 Y
12 101 236 Z
13 102 236 W
14 104 234 V
17 105 239 U
and my Table C contains following column:
employeeid, managerid, directorid, director, Date
and this table C should have output as:
Table C:
employeeid managerid directorid director date
12 100 234 X 2017-01-01
14 104 234 V 2017-01-02
Following code I am trying:
cursor.execute(""" select * from employee """)
results = cursor.fetchall()
for result in results:
employeeID = result[0]
managerID = result[2]
DirectorID = result[3]
Date = result[4]
cursor.execute(""" select * from manager """)
dataall = cursor.fetchall()
for data in dataall:
employee = data[0]
manager = data[1]
Director = data[2]
Director_tableb = data[3]
i = 0
j = 0
while i < len(dataall) and j < len(results):
for result in results:
if employeeID == employee:
for data in dataall:
if (employeeID == employee) and (managerID == manager) and (DirectorID == Resource):
cursor.execute(""" Insert into Table_C (%s, %s, %s, %s, %s) """, (employeeID, managerID, DirectorID, Director_tableb, Date))
cursor_db.commit()
i =+ 1
j =+ 1

Now, that's some content to work with.
As it is often the case with my answer, I will try to give you a working solution rather than an optimized one.
First, I'll store the results gotten from the first table in a few lists.
cursor.execute(""" select * from employee """)
results = cursor.fetchall()
employeeIDs = []
managerIDs = []
directorIDs = []
dates = []
for result in results:
employeeIDs.append(result[0])
managerIDs.append(result[2])
directorIDs.append(result[3])
dates.append(result[4])
Now, I'll get the content from the 2nd table and filling the 3rd table at the same time.
cursor.execute(""" select * from manager """)
dataall = cursor.fetchall()
for data in dataall:
employee = data[0]
manager = data[1]
director = data[2]
director_tableb = data[3]
i=0
while (i<len(employeeIDs)):
if (employeeIDs[i] == employee) & (managerIDs[i] == manager) & (directorIDs[i] == director):
cursor.execute(""" Insert into Table_C (%s, %s, %s, %s, %s) """, (employeeIDs[i], managerIDs[i], directorIDs[i], director_tableb, dates[i]))
cursor_db.commit()
i=len(employeeIDs)
i =+ 1
What I do : I loop over the elements of the 2nd table, and try to see for each element if it is also present in the first table.
If the element is present in the first table (if the condition of the if is true), then I put an end to the current inner loop and add the desired values to table C.
If the element is not found, then we'll move on the next data from the 2nd table.
I wrote this answer assuming the code you wrote with the cursors was correct, but if anything does not work as intended, tell me, so I can correct my answer.
If any clarification about what I did/why I did them is required, feel free to ask them.
If you have some ways to improve my code, do not hesitate to tell me so.

how to collapse/compress/reduce string columns in pandas

Essentially, what I am trying to do is join Table_A to Table_B using a key to do a lookup in Table_B to pull column records for names present in Table_A.
Table_B can be thought of as the master name table that stores various attributes about a name. Table_A represents incoming data with information about a name.
There are two columns that represent a name - a column named 'raw_name' and a column named 'real_name'. The 'raw_name' has the string "code_" before the real_name.
i.e.
raw_name = CE993_VincentHanna
real_name = VincentHanna
Key = real_name, which exists in Table_A and Table_B
Please see the mySQL tables and query here: http://sqlfiddle.com/#!9/65e13/1
For all real_names in Table_A that DO-NOT exist in Table_B I want to store raw_name/real_name pairs into an object so I can send an alert to the data-entry staff for manual insertion.
For all real_names in Table_A that DO exist in Table_B, which means we know about this name and can add the new raw_name associated with this real_name into our master Table_B
In mySQL, this is easy to do as you can see in my sqlfidde example. I join on real_name and I compress/collapse the result by groupby a.real_name since I don't care if there are multiple records in Table_B for the same real_name.
All I want is to pull the attributes (stats1, stats2, stats3) so I can assign them to the newly discovered raw_name.
In the mySQL query result I can then separate the NULL records to be sent for manual data-entry and automatically insert the remaining records into Table_B.
Now, I am trying to do the same in Pandas but am stuck at the point of groupby on real-name.
e = {'raw_name': pd.Series(['AW103_Waingro', 'CE993_VincentHanna', 'EES43_NeilMcCauley', 'SME16_ChrisShiherlis',
'MEC14_MichaelCheritto', 'OTP23_RogerVanZant', 'MDU232_AlanMarciano']),
'real_name': pd.Series(['Waingro', 'VincentHanna', 'NeilMcCauley', 'ChrisShiherlis', 'MichaelCheritto',
'RogerVanZant', 'AlanMarciano'])}
f = {'raw_name': pd.Series(['SME893_VincentHanna', 'TVA405_VincentHanna', 'MET783_NeilMcCauley',
'CE321_NeilMcCauley', 'CIN453_NeilMcCauley', 'NIPS16_ChrisShiherlis',
'ALTW12_MichaelCheritto', 'NSP42_MichaelCheritto', 'CONS23_RogerVanZant',
'WAUE34_RogerVanZant']),
'real_name': pd.Series(['VincentHanna', 'VincentHanna', 'NeilMcCauley', 'NeilMcCauley', 'NeilMcCauley',
'ChrisShiherlis', 'MichaelCheritto', 'MichaelCheritto', 'RogerVanZant',
'RogerVanZant']),
'stats1': pd.Series(['meh1', 'meh1', 'yo1', 'yo1', 'yo1', 'hello1', 'bye1', 'bye1', 'namaste1',
'namaste1']),
'stats2': pd.Series(['meh2', 'meh2', 'yo2', 'yo2', 'yo2', 'hello2', 'bye2', 'bye2', 'namaste2',
'namaste2']),
'stats3': pd.Series(['meh3', 'meh3', 'yo3', 'yo3', 'yo3', 'hello3', 'bye3', 'bye3', 'namaste3',
'namaste3'])}
df_e = pd.DataFrame(e)
df_f = pd.DataFrame(f)
df_new = pd.merge(df_e, df_f, how='left', on='real_name', suffixes=['_left', '_right'])
df_new_grouped = df_new.groupby(df_new['raw_name_left'])
Now how do I compress/collapse the groups in df_new_grouped on real-name like I did in mySQL.
Once I have an object with the collapsed results I can slice the dataframe to report real_names we don't have a record of (NULL values) and those that we already know and can store the newly discovered raw_name.

You can drop duplicates based on columns raw_name_left and also remove the raw_name_right column using drop
In [99]: df_new.drop_duplicates('raw_name_left').drop('raw_name_right', 1)
Out[99]:
raw_name_left real_name stats1 stats2 stats3
0 AW103_Waingro Waingro NaN NaN NaN
1 CE993_VincentHanna VincentHanna meh1 meh2 meh3
3 EES43_NeilMcCauley NeilMcCauley yo1 yo2 yo3
6 SME16_ChrisShiherlis ChrisShiherlis hello1 hello2 hello3
7 MEC14_MichaelCheritto MichaelCheritto bye1 bye2 bye3
9 OTP23_RogerVanZant RogerVanZant namaste1 namaste2 namaste3
11 MDU232_AlanMarciano AlanMarciano NaN NaN NaN

Just to be thorough, this can also be done using Groupby, which I found on Wes McKinney's blog although drop_duplicates is cleaner and more efficient.
http://wesmckinney.com/blog/filtering-out-duplicate-dataframe-rows/
>index = [gp_keys[0] for gp_keys in df_new_grouped.groups.values()]
>unique_df = df_new.reindex(index)
>unique_df

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Update SQL Database based on matched ID in Dataframe - python

Related

Selecting rows from sql if they are also in a dataframe

DuckDB - efficiently insert pandas dataframe to table with sequence

Executing an SQL update statement from a pandas dataframe

compare each column of 2 tables and write matching rows in a 3rd table using loop in python

how to collapse/compress/reduce string columns in pandas

Categories

Resources