Pandas to_sql() to update unique values in DB? - python

How can I use the df.to_sql(if_exists = 'append') to append ONLY the unique values between the dataframe and the database. In other words, I would like to evaluate the duplicates between the DF and the DB and drop those duplicates before writing to the database.
Is there a parameter for this?
I understand that the parameters if_exists = 'append' and if_exists = 'replace'is for the entire table - not the unique entries.
I am using:
sqlalchemy
pandas dataframe with the following datatypes:
index: datetime.datetime <-- Primary Key
float
float
float
float
integer
string <--- Primary Key
string<---- Primary Key
I'm stuck on this so your help is much appreciated. -Thanks

In pandas, there is no convenient argument in to_sql to append only non-duplicates to a final table. Consider using a staging temp table that pandas always replaces and then run a final append query to migrate temp table records to final table accounting only for unique PK's using the NOT EXISTS clause.
engine = sqlalchemy.create_engine(...)
df.to_sql(name='myTempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO myFinalTable (Col1, Col2, Col3, ...)
SELECT t.Col1, t.Col2, t.Col3, ...
FROM myTempTable t
WHERE NOT EXISTS
(SELECT 1 FROM myFinalTable f
WHERE t.MatchColumn1 = f.MatchColumn1
AND t.MatchColumn2 = f.MatchColumn2)"""
cn.execute(sql)
This would be an ANSI SQL solution and not restricted to vendor-specific methods like UPSERT and so is compliant in practically all SQL-integrated relational databases.

Related

Difficulty Inserting Json Dictionary To MSSQL

I am new to Python, but have a seemingly very simply exercise that I am struggling to figure out. This is a two part issue.
First: I have a list of Json objects that I am getting from an API. I am wanting to enter each list item as a row in a Dataframe to preserve that row's json object/dict for storing in a database to allow for editing and reposting. In addition, I am wanting to convert the list into a standard Dataframe (easy enough). In essence, it will be a standard dataframe with the row's raw json contained as an additional column. I've managed to accomplish this by joining a Series to a Dataframe, but I am relying on the index to join. My first question is whether the join can be done based not just on the index, but to join based on the 'id' value in the dataframe to the 'id' contained in each element in the json object/dict in the list? The rationale being that in doing so, I'd eliminate order concerns and I'd be 100% sure the json object is associated with the correct dataframe row.
Second: As I mentioned above, I've simply joined the series to the dataframe using the index, which works in this case because the both use the same 'data' list. The output looks good when I print but when I go to insert it into MSSQL, it does not like the json dictionary ('Json_Series' column). The insert works fine when I just eliminate that field. I could see how an inline fstring for the insert concatenation might work if you did some type of cast or convert on the dict, but I will be doing this for many APIs so I am trying to keep from having to write custom insert language (i.e. would like to rely on a to_sql or equivalent method/class/library). I have also tried changing the column with .astype('str') prior to insert, but that doesn't work as has been documented elsewhere.
The error I get is:
sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('Invalid parameter type. param-index=0 param-type=dict', 'HY105')
[SQL: INSERT INTO dbo.[TestInsert_JsonObject_Python] ([Json_Series], id, value) VALUES (?, ?, ?), (?, ?, ?)]
[parameters: ({'id': 1, 'value': 2}, 1, 2, {'id': 3, 'value': 4}, 3, 4)]
Clearly, it is the first parameter and it sees it as a dictionary. Removing that column resolves the issue.
Here is what I've tried. This shows the Series (df), the dataframe (df2) and the combined join (inner_merged'). The join is vulnerable in the future to order if I were to join different lists. I am needing the join to reference the internal 'id' in the json dict when joining against the dataframe's 'id' column The combined join ('inner_merged') is however in the form that I'd like to be able to just insert into SQL.
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.engine import URL
data = [{"id": 1, "value": 2}, {"id": 3, "value": 4}]
df = pd.Series(data, name='Json_Series')
df2= pd.DataFrame(data)
inner_merged = pd.merge(df, df2, left_index=True, right_index=True)
print(df)
print(df2)
# inner_merged['Json_Series'] = inner_merged['Json_Series'].astype('str')
print(inner_merged)
Here is the MSSQL table creation:
CREATE TABLE [dbo].[TestInsert_JsonObject_Python](
[Json_Series] [varchar](500) NULL,
[id] [int] NULL,
[value] int Null,
)
GO
CREATE TABLE [dbo].[TestInsert_NoJsonObject_Python](
[id] [int] NULL,
[value] int Null,
)
GO
Here is the insert - the first one without the json object and the second with:
server = 'enteryourserver'
database = 'enteryourdatabase'
username = 'enteryourusername'
password = 'enteryourpassword'
driver = '{ODBC Driver 18 for SQL Server}'
table1 = 'TestInsert_NoJsonObject_Python'
table2 = 'TestInsert_JsonObject_Python'
schema = 'dbo'
connection_string = f"DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}"
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
engine = create_engine(connection_url)
df2.to_sql(table1, con=engine, schema= schema, if_exists='replace', index=False)
inner_merged.to_sql(table2, con=engine, schema= schema, if_exists='replace', index=False)
As I mentioned, I am totally new to this so any suggestions are much appreciated.

What is an efficient way to run SQL update for all rows in a pandas dataframe?

I'm currently looping through the dataframe updating the SQL table for each primary key row, but this is taking a very long time.
Is there a quicker way to implement the following logic:
with engine.begin() as conn:
for i in range(0, len(df['primary_key'])):
conn.execute('UPDATE SQL_TABLE
SET Column1 = df['Column1'].iloc[i]
WHERE primary_key = df['primary_key'].iloc[i]')

Copying a distinct sqlite table grouped by single column results in IntegrityError: Column is not unique (Python)

I have a relatively small sqlite3 database (~2.6GB) with 820k rows and 26 columns (single table). Within this database there is a table named old_table, at the moment I created this table I had no primary key, and therefore adding new rows was prone to having duplicates being added.
In terms of efficiency, I created the same database again, but this time with the column Ref set as primary key: 'Ref VARCHAR(50) PRIMARY KEY,'. According to many resources, we should be able to select only the unique rows based on a single column with the query SELECT * from old_table GROUP BY Ref. I want to keep the unique values so I insert them into a new table with INSERT INTO new_table. Afterwards I would like to drop the old table with DROP TABLE old_table. Finally, the new_table should be renamed to old_table with ALTER TABLE new_table RENAME TO new_table.
Why does my sql state that column Ref is not unique?
#Transferring old database to new one, with ref as unique primary key
#And delting old_table
conn = connect_to_db("/mnt/wwn-0x5002538e00000000-part1/DATABASE/database.db")
c = conn.cursor()
c.executescript("""
INSERT INTO new_table SELECT * from old_table GROUP BY Ref;
DROP TABLE old_table;
RENAME TO new_table
""")
conn.close()
---------------------------------------------------------------------------
IntegrityError: column Ref is not unique

Update SQL database with dataframe content

I have a pandas dataframe containing two columns: ID and MY_DATA. I have an SQL database that contains a column named ID and some other data. I want to match the ID of the SQL database column to the rows of the dataframe ID column and update it with a new column MY_DATA.
So far I used the following:
import sqlite3
df = pd.read_csv('my_filename.csv')
con = sqlite3.connect('my_database.sqlite')
cur = con.cursor()
for row in cur.execute('SELECT ID FROM main;'):
for i in len(df):
if (row[i] == df.ID.iloc[i]):
update_sqldb(df, i)
However, I think this way of having two nested for-loops is probably ugly and not very pythonic. I thought that maybe I should use the map() function, but is this the right direction to go?

column names as variables while creating table

Iam new to python sqlite, and I have a problem with create table query.
I need to create a table, but I have the column names of the table as a list.
columnslist = ["column1", "column2", "column3"]
Now, I have to create a table MyTable with the above columns. But the problem is, I won't know before hand how may columns are there in columnslist
Is it possible to create a table with the number of and name of columns given in columnslist and its syntax?
You can first convert your list to tuple and use str.format:
import sqlite3
conn = sqlite3.connect('example.db')
c = conn.cursor()
c.execute('''CREATE TABLE table_name {}'''.format(tuple(column_list)))

Categories