I know that insert or update if key exists option for .to_sql() hasn't been implemented yet, so I'm looking for an alternative.
The first thing that comes to mind is to use the append option:
data.to_sql(
"Dim_Objects",
con=connection,
if_exists="append",
index=False
)
and remove duplicates in the database separately, after I inserted data:
DELETE FROM "Dim_Objects" a
USING "Dim_Objects" b
WHERE a."Code" = b."Code"
AND a."TimeStampUpdate" < b."TimeStampUpdate"
In this case, if there's a duplicate, I only keep the latest entry.
This approach seems to work but I hoped I could achieve the same using pandas directly.
Any ideas?
can you try?
data.to_sql('Dim_Objects', con=connection, if_exists='replace')
sql = """
UPDATE "different_table" AS f
SET col1 = b.col1
FROM your_table_name AS data
WHERE a."Code" = b."Code"
"""
with engine.begin() as conn:
conn.execute(sql)
Related
I am using pandas' to_sql method to insert data into a mysql table. The mysql table already exists and I'd like to avoid inserting duplicate rows.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
Is there a way to do this in python?
# mysql connection
import pandas as pd
import pymysql
from sqlalchemy import create_engine
user = 'user1'
pwd = 'xxxx'
host = 'aa1.us-west-1.rds.amazonaws.com'
port = 3306
database = 'main'
engine = create_engine("mysql+pymysql://{}:{}#{}/{}".format(user,pwd,host,database))
con = engine.connect()
df.to_sql(name="dfx", con=con, if_exists = 'append')
con.close()
Are there any work-arounds, if there isn't a straight forward way to do this?
It sounds like you want to do an "upsert" (insert or update). Pangres is a useful package that will allow you to do an upsert using a pandas df. If you don't want to update the row if it exists, that is also an option by setting if_row_exists to 'ignore'
I have never heard of 'upsert' before today, but it sounds interesting. You could certainly delete dupes after the data is loaded into your table.
WITH a as
(
SELECT Firstname,ROW_NUMBER() OVER(PARTITION by Firstname, empID ORDER BY Firstname)
AS duplicateRecCount
FROM dbo.tblEmployee
)
--Now Delete Duplicate Records
DELETE FROM a
WHERE duplicateRecCount > 1
That will work fine, unless you have billions of rows.
Here is my code
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("connection string")
conn_obj = engine.connect()
my_df = pd.DataFrame({'col1': ['29199'], 'date_created': ['2022-06-29 17:15:49.776867']})
my_df.to_sql('SomeSQLTable', conn_obj, if_exists='append', index = False)
I also created SomeSQLTable with script:
CREATE TABLE SomeSQLTable(
col1 nvarchar(90),
date_created datetime2)
GO
Everything runs fine, but no records are inserted into SQL table and no errors are displayed. I am not sure how to troubleshoot. conn_obj works fine, I was able to pull data.
I don't think it's exactly the answer but I don't have the privileges of commenting right now.
First of all, the pd.to_sql() returns the number of rows affected by the operation, can you please check that?
Lastly, you are defining the data types in the table creation, it could be a problem of casting the data types. I never create the table through sql as pd.to_sql() can create it if needed.
Thirdly, Please check on the table name, there could be an issue with the pascal case in some db's.
I have a database that contains two tables in the data, cdr and mtr. I want a join of the two based on columns ego_id and alter_id, and I want to output this into another table in the same database, complete with the column names, without the use of pandas.
Here's my current code:
mtr_table = Table('mtr', MetaData(), autoload=True, autoload_with=engine)
print(mtr_table.columns.keys())
cdr_table = Table('cdr', MetaData(), autoload=True, autoload_with=engine)
print(cdr_table.columns.keys())
query = db.select([cdr_table])
query = query.select_from(mtr_table.join(cdr_table,
((mtr_table.columns.ego_id == cdr_table.columns.ego_id) &
(mtr_table.columns.alter_id == cdr_table.columns.alter_id))),
)
results = connection.execute(query).fetchmany()
Currently, for my test code, what I do is to convert the results as a pandas dataframe and then put it back in the original SQL database:
df = pd.DataFrame(results, columns=results[0].keys())
df.to_sql(...)
but I have two problems:
loading everything into a pandas dataframe would require too much memory when I start working with the full database
the columns names are (apparently) not included in results and would need to be accessed by results[0].keys()
I've checked this other stackoverflow question but it uses the ORM framework of sqlalchemy, which I unfortunately don't understand. If there's a simpler way to do this (like pandas' to_sql), I think this would be easier.
What's the easiest way to go about this?
So I found out how to do this via CREATE TABLE AS:
query = """
CREATE TABLE mtr_cdr AS
SELECT
mtr.idx,cdr.*
FROM mtr INNER JOIN cdr
ON (mtr.ego_id = cdr.ego_id AND mtr.alter_id = cdr.alter_id)""".format(new_table)
with engine.connect() as conn:
conn.execute(query)
The query string seems to be highly sensitive to parentheses though. If I put a parentheses enclosing the whole SELECT...FROM... statement, it doesn't work.
I am using the below python code to update postgres DB column valuebased on Id. This loop has to run for thousands of records and it is taking longer time.
Is there a way where I can pass array of dataframe values instead of looping each row?
for i in range(0,len(df)):
QUERY=""" UPDATE "Table" SET "value"='%s' WHERE "Table"."id"='%s'
""" % (df['value'][i], df['id'][i])
cur.execute(QUERY)
conn.commit()
Depends on a library you use to communicate with PostgreSQL, but usually bulk inserts are much faster via COPY FROM command.
If you use psycopg2 it is as simple as following:
cursor.copy_from(io.StringIO(string_variable), "destination_table", columns=('id', 'value'))
Where string_variable is tab and new line delimited dataset like 1\tvalue1\n2\tvalue2\n.
To achieve a performant bulk update I would do:
Create a temporary table: CREATE TEMPORARY TABLE tmp_table;;
Insert records with copy_from;
Just update destination table with query UPDATE destination_table SET value = t.value FROM tmp_table t WHERE id = t.id or any other preferred syntax
I want to "insert ignore" an entire pandas dataframe into mysql. Is there a way to do this without looping over the rows?
In dataframe.to_sql I only see the option if_exists 'append' but will this still continue on duplicate unique keys?
Consider using a temp table (with exact structure of final table) that is always replaced by pandas then run the INSERT IGNORE in a cursor call:
dataframe.to_sql('myTempTable', con, if_exists ='replace')
cur = con.cursor()
cur.execute("INSERT IGNORE INTO myFinalTable SELECT * FROM myTempTable")
con.commit()
There is no way to do this in pandas till the current version of pandas (0.20.3) .
The option if_exists applies only on table ( not on rows ) as stated in the documentation.
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.
Via Looping
This will slow down the process as you are inserting one row at a time
for x in xrange(data_frame.shape[0]):
try:
data_frame.iloc[x:x+1].to_sql(con=sql_engine, name="table_name", if_exists='append')
except IntegrityError:
# Your code to handle duplicates
pass