Pandas to_sql fails on duplicate primary key

Pandas to_sql fails on duplicate primary key - python

I'd like to append to an existing table, using pandas df.to_sql() function.
I set if_exists='append', but my table has primary keys.
I'd like to do the equivalent of insert ignore when trying to append to the existing table, so I would avoid a duplicate entry error.
Is this possible with pandas, or do I need to write an explicit query?

There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)
for i in range(len(df)):
try:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
except IntegrityError:
pass #or any other action

You can do this with the method parameter of to_sql:
from sqlalchemy.dialects.mysql import insert
def insert_on_duplicate(table, conn, keys, data_iter):
insert_stmt = insert(table.table).values(list(data_iter))
on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
conn.execute(on_duplicate_key_stmt)
df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)
for older versions of sqlalchemy, you need to pass a dict to on_duplicate_key_update. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))

please note that the "if_exists='append'" related to the existing of the table and what to do in case the table not exists.
The if_exists don't related to the content of the table.
see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.

Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.

The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql
The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.
def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import table, column
columns=[]
for c in keys:
columns.append(column(c))
if sqltable.schema:
table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
else:
table_name = sqltable.name
mytable = table(table_name, *columns)
insert_stmt = insert(mytable).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])
conn.execute(do_nothing_stmt)
pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)

Pandas doesn't support editing the actual SQL syntax of the .to_sql method, so you might be out of luck. There's some experimental programmatic workarounds (say, read the Dataframe to a SQLAlchemy object with CALCHIPAN and use SQLAlchemy for the transaction), but you may be better served by writing your DataFrame to a CSV and loading it with an explicit MySQL function.
CALCHIPAN repo: https://bitbucket.org/zzzeek/calchipan/

I had trouble where I was still getting the IntegrityError
...strange but I just took the above and worked it backwards:
for i, row in df.iterrows():
sql = "SELECT * FROM `Table_Name` WHERE `key` = '{}'".format(row.Key)
found = pd.read_sql(sql, con=Engine)
if len(found) == 0:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)

In my case, I was trying to insert new data in an empty table, but some of the rows are duplicated, almost the same issue here, I "may" think about fetching existing data and merge with the new data I got and continue in process, but this is not optimal, and may work only for small data, not a huge tables.
As pandas do not provide any kind of handling for this situation right now, I was looking for a suitable workaround for this, so I made my own, not sure if that will work or not for you, but I decided to control my data first instead of luck of waiting if that worked or not, so what I did is removing duplicates before I call .to_sql so if any error happens, I know more about my data and make sure I know what is going on:
import pandas as pd
def write_to_table(table_name, data):
df = pd.DataFrame(data)
# Sort by price, so we remove the duplicates after keeping the lowest only
data.sort(key=lambda row: row['price'])
df.drop_duplicates(subset=['id_key'], keep='first', inplace=True)
#
df.to_sql(table_name, engine, index=False, if_exists='append', schema='public')
So in my case, I wanted to keep the lowest price of rows (btw I was passing an array of dict for data), and for that, I did sorting first, not necessary but this is an example of what I mean with control the data that I want to keep.
I hope this will help someone who got almost the same as my situation.

When you use SQL Server you'll get a SQL error when you enter a duplicate value into a table that has a primary key constraint. You can fix it by altering your table:
CREATE TABLE [dbo].[DeleteMe](
[id] [uniqueidentifier] NOT NULL,
[Value] [varchar](max) NULL,
CONSTRAINT [PK_DeleteMe]
PRIMARY KEY ([id] ASC)
WITH (IGNORE_DUP_KEY = ON)); <-- add
Taken from https://dba.stackexchange.com/a/111771.
Now your df.to_sql() should work again.

The solutions by Jayen and Huy Tran helped me a lot, but they didn't work straight out of the box. The problem I faced with Jayen code is that it requires that the DataFrame columns be exactly as those of the database. This was not true in my case as there were some DataFrame columns that I won't write to the database.
I modified the solution so that it considers the column names.
from sqlalchemy.dialects.mysql import insert
import itertools
def insertWithConflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data, whilst taking care of conflicts
Used to handle duplicate key errors during database population
This is my modification of the code snippet
from https://stackoverflow.com/questions/30337394/pandas-to-sql-fails-on-duplicate-primary-key
The help page from https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.values
proved useful.
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted. It is a zip object.
The length of it is equal to the chunck size passed in df_to_sql()
"""
vals = [dict(zip(z[0],z[1])) for z in zip(itertools.cycle([keys]),data_iter)]
insertStmt = insert(sqltable.table).values(vals)
doNothingStmt = insertStmt.on_duplicate_key_update(dict(insertStmt.inserted))
conn.execute(doNothingStmt)

I faced the same issue and I adopted the solution provided by #Huy Tran for a while, until my tables started to have schemas.
I had to improve his answer a bit and this is the final result:
def do_nothing_on_conflicts(sql_table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sql_table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
columns = []
for c in keys:
columns.append(column(c))
if sql_table.schema:
my_table = table(sql_table.name, *columns, schema=sql_table.schema)
# table_name = '{}.{}'.format(sql_table.schema, sql_table.name)
else:
my_table = table(sql_table.name, *columns)
# table_name = sql_table.name
# my_table = table(table_name, *columns)
insert_stmt = insert(my_table).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing()
conn.execute(do_nothing_stmt)
How to use it:
history.to_sql('history', schema=schema, con=engine, method=do_nothing_on_conflicts)

The idea is the same as #Nfern's but uses recursive function to divide the df into half in each iteration to skip the row/rows causing the integrity violation.
def insert(df):
try:
# inserting into backup table
df.to_sql("table",con=engine, if_exists='append',index=False,schema='schema')
except:
rows = df.shape[0]
if rows>1:
df1 = df.iloc[:int(rows/2),:]
df2 = df.iloc[int(rows/2):,:]
insert(df1)
insert(df2)
else:
print(f"{df} not inserted. Integrity violation, duplicate primary key/s")

Related

Looping trough dataframe and checking rows before appending to database

Question How do I append my dataframe to database so that it checks if stock_ticker exists , only to append the rows where stock_ticker does not exist?
This is the process that I did
Import CSV file to pandas dataframe
Assign column names to be same as in database
Sending the dataframe to database using the code below but getting
sqlite3.IntegrityError: UNIQUE constraint failed: stocks.stock_ticker
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
I looked at other Integrity Error cases but can't seem to find one that works with appending dataframes? I found and tried this but all it does is just not append anything.
try:
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
print("Already in database")
I am not sure I am understanding the iterating thing correctly
How to iterate over rows in a DataFrame in Pandas
So I tried this, but it just prints out already in database for each of them. Even tough there is 4 new stock tickers.
for index, row in df.iterrows():
try:
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
print("Already in database")
The database looks like this
any insight much appreciated :)

It looks like this happens because Pandas doesn't allow for declaring a proper ON CONFLICT policy, in case you try to append data to a table that has the same (unique) primary key or violates some other UNIQUEness constraint. if_exists only refers to the whole table itself, not each individual row.
I think you already came up with a pretty good answer, and maybe with a small modification it would work for you:
# After connecting
for i in range(len(df)):
try:
df[df.index == i].to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
pass
Now, this might be a problem if you want to actually replace the value if a newer one appears in your Pandas data and let's say you want to replace the old one that you have in the database. In that case, you might want to use the raw SQL command as a string, and pass the Pandas values iteratively. For example:
insert_statement = """
INSERT INTO stocks (stock_id,
stock_ticker,
{other columns})
VALUES (%s, %s, {as many %s as columns})
ON CONFLICT (stock_id) DO UPDATE
SET {Define which values you will update on conflict}"""
And then you could run
for i in range(len(df)):
values = tuple(df.iloc[i])
cursor.execute(insert_statement, values)

how to delete the only the rows in postgres but not to drop table using pandas read_sql_query method?

I wanted to perform an operation where i would like to delete all the rows(but not to drop the table) in postgres and update with new rows in it. And I wanted to use pd.read_sql_query() method from pandas:
qry = 'delete from "table_name"'
pd.read_sql_query(qry, conection, **kwargs)
But it was throwing error 'ResourceClosedError: This result object does not return rows. It has been closed automatically.'
I can expect this because the method should return the empty dataframe.But it was not returning any empty dataframe but only the the above error. Could you please help me in resolving it??

I use MySql, but the logic is the same:
Query 1: Choose all ids from you table
Quear 2: Delete all this ids
As a result you have:
Delete FROM table_name WHERE id IN (Select id FROM table_name)
The line do not return anuthing, it just delete all rows with a special id. I recomend to do the command using psycopg only - no pandas.
Then you need another query to get smth from db like:
pd.read_sql_query("SELECT * FROM table_name", conection, **kwargs)
Probably (I do not use pandas to read from db) in this case you'll get empty dataframe with Column names
Probably you can combine all the actions, the following way:
pd.read_sql_query('''Delete FROM table_name WHERE id IN (Select id FROM table_name); SELECT * FROM table_name''', conection, **kwargs)
Please try and share your results.

You can follow the next steps!
Check 'row existence' first in the table.
And then delete rows
Example code
check_row_query = "select exists(select * from tbl_name limit 1)"
check_exist = pd.read_sql_query(check_row_query, con)
if check_exist.exists[0]:
delete_query = 'DELETE FROM tbl_name WHERE condtion(s)'
con.execute(delete_query) # to delete rows using a sqlalchemy function
print('Delete all rows!)
else:
pass

Bulk Upsert with SQLAlchemy Postgres

I'm following the SQLAlchemy documentation here to write a bulk upsert statement with Postgres. For demonstration purposes, I have a simple table MyTable:
class MyTable(base):
__tablename__ = 'mytable'
id = Column(types.Integer, primary_key=True)
test_value = Column(types.Text)
Creating a generic insert statement is simple enough:
from sqlalchemy.dialects import postgresql
values = [{'id': 0, 'test_value': 'a'}, {'id': 1, 'test_value': 'b'}]
insert_stmt = postgresql.insert(MyTable.__table__).values(values)
The problem I run into is when I try to add the "on conflict" part of the upsert.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
Trying to execute this statement yields a ProgrammingError:
from sqlalchemy import create_engine
engine = create_engine('postgres://localhost/db_name')
engine.execute(update_stmt)
>>> ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'
I think my misunderstanding is in constructing the statement with the on_conflict_do_update method. Does anyone know how to construct this statement? I have looked at other questions on StackOverflow (eg. here) but I can't seem to a way to address the above error.

update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
index_elements should either be a list of strings or a list of column objects. So either [MyTable.id] or ['id'] (This is correct)
set_ should be a dictionary with column names as keys and valid sql update objects as values. You can reference values from the insert block using the excluded attribute. So to get the result you are hoping for here you would want set_={'test_value': insert_stmt.excluded.test_value} (The error you made is that data= in the example isn't a magic argument... it was the name of the column on their example table)
So, the whole thing would be
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_={'test_value': insert_stmt.excluded.test_value}
)
Of course, in a real world example I usually want to change more then one column. In that case I would do something like...
update_columns = {col.name: col for col in insert_stmt.excluded if col.name not in ('id', 'datetime_created')}
update_statement = insert_stmt.on_conflict_do_update(index_elements=['id'], set_=update_columns)
(This example would overwrite every column except for the id and datetime_created columns)

Python Pandas to_sql, how to create a table with a primary key?

I would like to create a MySQL table with Pandas' to_sql function which has a primary key (it is usually kind of good to have a primary key in a mysql table) as so:
group_export.to_sql(con = db, name = config.table_group_export, if_exists = 'replace', flavor = 'mysql', index = False)
but this creates a table without any primary key, (or even without any index).
The documentation mentions the parameter 'index_label' which combined with the 'index' parameter could be used to create an index but doesn't mention any option for primary keys.
Documentation

Simply add the primary key after uploading the table with pandas.
group_export.to_sql(con=engine, name=example_table, if_exists='replace',
flavor='mysql', index=False)
with engine.connect() as con:
con.execute('ALTER TABLE `example_table` ADD PRIMARY KEY (`ID_column`);')

Disclaimer: this answer is more experimental then practical, but maybe worth mention.
I found that class pandas.io.sql.SQLTable has named argument key and if you assign it the name of the field then this field becomes the primary key:
Unfortunately you can't just transfer this argument from DataFrame.to_sql() function. To use it you should:
create pandas.io.SQLDatabase instance
engine = sa.create_engine('postgresql:///somedb')
pandas_sql = pd.io.sql.pandasSQL_builder(engine, schema=None, flavor=None)
define function analoguous to pandas.io.SQLDatabase.to_sql() but with additional *kwargs argument which is passed to pandas.io.SQLTable object created inside it (i've just copied original to_sql() method and added *kwargs):
def to_sql_k(self, frame, name, if_exists='fail', index=True,
index_label=None, schema=None, chunksize=None, dtype=None, **kwargs):
if dtype is not None:
from sqlalchemy.types import to_instance, TypeEngine
for col, my_type in dtype.items():
if not isinstance(to_instance(my_type), TypeEngine):
raise ValueError('The type of %s is not a SQLAlchemy '
'type ' % col)
table = pd.io.sql.SQLTable(name, self, frame=frame, index=index,
if_exists=if_exists, index_label=index_label,
schema=schema, dtype=dtype, **kwargs)
table.create()
table.insert(chunksize)
call this function with your SQLDatabase instance and the dataframe you want to save
to_sql_k(pandas_sql, df2save, 'tmp',
index=True, index_label='id', keys='id', if_exists='replace')
And we get something like
CREATE TABLE public.tmp
(
id bigint NOT NULL DEFAULT nextval('tmp_id_seq'::regclass),
...
)
in the database.
PS You can of course monkey-patch DataFrame, io.SQLDatabase and io.to_sql() functions to use this workaround with convenience.

As of pandas 0.15, at least for some flavors, you can use argument dtype to define a primary key column. You can even activate AUTOINCREMENT this way. For sqlite3, this would look like so:
import sqlite3
import pandas as pd
df = pd.DataFrame({'MyID': [1, 2, 3], 'Data': [3, 2, 6]})
with sqlite3.connect('foo.db') as con:
df.to_sql('df', con=con, dtype={'MyID': 'INTEGER PRIMARY KEY AUTOINCREMENT'})

with engine.connect() as con:
con.execute('ALTER TABLE for_import_ml ADD PRIMARY KEY ("ID");')
for_import_ml is a table name in the database.
Adding a slight variation to tomp's answer (I would comment but don't have enough reputation points).
I am using PGAdmin with Postgres (on Heroku) to check and it works.

automap_base from sqlalchemy.ext.automap (tableNamesDict is a dict with only the Pandas tables):
metadata = MetaData()
metadata.reflect(db.engine, only=tableNamesDict.values())
Base = automap_base(metadata=metadata)
Base.prepare()
Which would have worked perfectly, except for one problem, automap requires the tables to have a primary key. Ok, no problem, I'm sure Pandas to_sql has a way to indicate the primary key... nope. This is where it gets a little hacky:
for df in dfs.keys():
cols = dfs[df].columns
cols = [str(col) for col in cols if 'id' in col.lower()]
schema = pd.io.sql.get_schema(dfs[df],df, con=db.engine, keys=cols)
db.engine.execute('DROP TABLE ' + df + ';')
db.engine.execute(schema)
dfs[df].to_sql(df,con=db.engine, index=False, if_exists='append')
I iterate thru the dict of DataFrames, get a list of the columns to use for the primary key (i.e. those containing id), use get_schema to create the empty tables then append the DataFrame to the table.
Now that you have the models, you can explicitly name and use them (i.e. User = Base.classes.user) with session.query or create a dict of all the classes with something like this:
alchemyClassDict = {}
for t in Base.classes.keys():
alchemyClassDict[t] = Base.classes[t]
And query with:
res = db.session.query(alchemyClassDict['user']).first()

SQLite insert or ignore and return original _rowid_

I've spent some time reading the SQLite docs, various questions and answers here on Stack Overflow, and this thing, but have not come to a full answer.
I know that there is no way to do something like INSERT OR IGNORE INTO foo VALUES(...) with SQLite and get back the rowid of the original row, and that the closest to it would be INSERT OR REPLACE but that deletes the entire row and inserts a new row and thus gets a new rowid.
Example table:
CREATE TABLE foo(
id INTEGER PRIMARY KEY AUTOINCREMENT,
data TEXT
);
Right now I can do:
sql = sqlite3.connect(":memory:")
# create database
sql.execute("INSERT OR IGNORE INTO foo(data) VALUES(?);", ("Some text.", ))
the_id_of_the_row = None
for row in sql.execute("SELECT id FROM foo WHERE data = ?", ("Some text.", )):
the_id_of_the_row = row[0]
But something ideal would look like:
the_id_of_the_row = sql.execute("INSERT OR IGNORE foo(data) VALUES(?)", ("Some text", )).lastrowid
What is the best (read: most efficient) way to insert a row into a table and return the rowid, or to ignore the row if it already exists and just get the rowid? Efficiency is important because this will be happening quite often.
Is there a way to INSERT OR IGNORE and return the rowid of the row that the ignored row was compared to? This would be great, as it would be just as efficient as an insert.

The way that worked the best for me was to insert or ignore the values, and the select the rowid in two separate steps. I used a unique constraint on the data column to both speed up selects and avoid duplicates.
sql.execute("INSERT OR IGNORE INTO foo(data) VALUES(?);" ("Some text.", ))
last_row_id = sql.execute("SELECT id FROM foo WHERE data = ?;" ("Some text. ", ))
The select statement isn't as slow as I thought it would be. This, it seems, is due to SQLite automatically creating an index for the unique columns.

INSERT OR IGNORE is for situations where you do not care about the identity of the record; where the goal is only to have some record with that specific value.
If you want to know whether a new record is inserted or not, you have to check by hand:
the_id_of_the_row = None
for row in sql.execute("SELECT id FROM foo WHERE data = ?", ...):
the_id_of_the_row = row[0]
if the_id_of_the_row is None:
c = sql.cursor()
c.execute("INSERT INTO foo(data) VALUES(?)", ...)
the_id_of_the_row = c.lastrowid
As for efficiency: when SQLite checks the datacolumn for duplicates, it has to do exactly the same query that you're doing with the SELECT, and once you've done that, the access path is in the cache, so performance should not be a problem. In any case, it is necessary to execute two separate INSERT/SELECT queries (in either order, both your and my code work, but yours is simpler).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas to_sql fails on duplicate primary key - python

Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.

Related

Looping trough dataframe and checking rows before appending to database

how to delete the only the rows in postgres but not to drop table using pandas read_sql_query method?

Bulk Upsert with SQLAlchemy Postgres

Python Pandas to_sql, how to create a table with a primary key?

SQLite insert or ignore and return original _rowid_

Categories

Resources