Looping trough dataframe and checking rows before appending to database

Looping trough dataframe and checking rows before appending to database - python

Question How do I append my dataframe to database so that it checks if stock_ticker exists , only to append the rows where stock_ticker does not exist?
This is the process that I did
Import CSV file to pandas dataframe
Assign column names to be same as in database
Sending the dataframe to database using the code below but getting
sqlite3.IntegrityError: UNIQUE constraint failed: stocks.stock_ticker
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
I looked at other Integrity Error cases but can't seem to find one that works with appending dataframes? I found and tried this but all it does is just not append anything.
try:
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
print("Already in database")
I am not sure I am understanding the iterating thing correctly
How to iterate over rows in a DataFrame in Pandas
So I tried this, but it just prints out already in database for each of them. Even tough there is 4 new stock tickers.
for index, row in df.iterrows():
try:
conn = sqlite3.connect('stockmarket.db')
c = conn.cursor()
df.to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
print("Already in database")
The database looks like this
any insight much appreciated :)

It looks like this happens because Pandas doesn't allow for declaring a proper ON CONFLICT policy, in case you try to append data to a table that has the same (unique) primary key or violates some other UNIQUEness constraint. if_exists only refers to the whole table itself, not each individual row.
I think you already came up with a pretty good answer, and maybe with a small modification it would work for you:
# After connecting
for i in range(len(df)):
try:
df[df.index == i].to_sql(name='stocks', con=conn, if_exists='append', index=False)
conn.commit()
except sqlite3.IntegrityError:
pass
Now, this might be a problem if you want to actually replace the value if a newer one appears in your Pandas data and let's say you want to replace the old one that you have in the database. In that case, you might want to use the raw SQL command as a string, and pass the Pandas values iteratively. For example:
insert_statement = """
INSERT INTO stocks (stock_id,
stock_ticker,
{other columns})
VALUES (%s, %s, {as many %s as columns})
ON CONFLICT (stock_id) DO UPDATE
SET {Define which values you will update on conflict}"""
And then you could run
for i in range(len(df)):
values = tuple(df.iloc[i])
cursor.execute(insert_statement, values)

Related

How do I update the column in postgresql python?

I am trying to add a new column in existing table and want to populate that column in database, there is a predictions column which is dataframe it is giving me error what I am doing wrong,
Code:
conn = create_connection()
cur = conn.cursor()
query = "ALTER TABLE STOCK_MARKET_FORECASTING ADD COLUMN predictions float"
cur.execute(query)
# Inserting predictions in database
def inserting_records(df):
for i in range(0 ,len(df)):
values = (df['Predicted_values_Hourly_Interval'][i])
cur.execute("UPDATE STOCK_MARKET_FORECASTING SET (predictions) VALUES (%s)", values)
conn.commit()
print("Records created successfully")
inserting_records(predictions)

You're passing in a single value – cur.execute requires a tuple of values.
You're probably looking for INSERT, not UPDATE. UPDATE updates existing rows.
def inserting_records(df):
series = df['Predicted_values_Hourly_Interval']
for val in series:
cur.execute("INSERT INTO STOCK_MARKET_FORECASTING (predictions) VALUES (%s)", (val, ))
conn.commit()
might be what you're looking for.

Value error inserting into Postgres table with psycopg2

I've been trying to use this piece of code:
# df is the dataframe
if len(df) > 0:
df_columns = list(df)
# create (col1,col2,...)
columns = ",".join(df_columns)
# create VALUES('%s', '%s",...) one '%s' per column
values = "VALUES({})".format(",".join(["%s" for _ in df_columns]))
#create INSERT INTO table (columns) VALUES('%s',...)
insert_stmt = "INSERT INTO {} ({}) {}".format(table,columns,values)
cur = conn.cursor()
cur = db_conn.cursor()
psycopg2.extras.execute_batch(cur, insert_stmt, df.values)
conn.commit()
cur.close()
So I could connect into Postgres DB and insert values from a df.
I get these 2 errors for this code:
LINE 1: INSERT INTO mrr.shipments (mainFreight_freight_motherVesselD...
psycopg2.errors.UndefinedColumn: column "mainfreight_freight_mothervesseldepartdatetime" of relation "shipments" does not exist
for some reason, the columns can't get the values properly
What can I do to fix it?

You should not do your own string interpolation; let psycopg2 handle it. From the docs:
Warning Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.
Since you also have dynamic column names, you should use psycopg2.sql to create the statement and then use the standard method of passing query parameters to psycopg2 instead of using format.

pandas to_sql insert ignore

I want to incrementally keep adding data frame rows into MySQL DB avoiding any duplicate entries to go in MySQL.
I am currently doing this by looping through every row using df.apply()and calling MySQL insert ignore(duplicates) to add unique rows into MySQL database. But using pandas.apply is very slow(45 secs for 10k rows). I want to achieve this using pandas.to_sql() method which takes 0.5 secs to push 10k entries into DB but doesn't support ignore duplicate in append mode.
Is there an efficient and fast way to achieve this?
Input CSV
Date,Open,High,Low,Close,Volume
1994-01-03,111.7,112.75,111.55,112.65,0
1994-01-04,112.68,113.47,112.2,112.65,0
1994-01-05,112.6,113.63,112.3,113.0,0
1994-01-06,113.02,113.43,112.25,112.62,0
1994-01-07,112.55,112.8,111.5,111.88,0
1994-01-10,111.8,112.43,111.35,112.25,0
1994-01-11,112.18,112.88,112.05,112.4,0
1994-01-12,112.38,112.82,111.95,112.28,0
code
nifty_data.to_sql(name='eod_data', con=engine, if_exists = 'append', index=False) # option-1
nifty_data.apply(addToDb, axis=1) # option-2
def addToDb(row):
sql = "INSERT IGNORE INTO eod_data (date, open, high, low, close, volume) VALUES (%s,%s,%s,%s,%s,%s)"
val = (row['Date'], row['Open'], row['High'], row['Low'], row['Close'], row['Volume'])
mycursor.execute(sql, val)
mydb.commit()`
option-1: doesn't allow insert ignore (~0.5 secs)
option-2: has to loop through and is very slow (~45 secs)

You can create a temporary table:
nifty_data.to_sql(name='temporary_table', con=engine, if_exists = 'append', index=False)
And then run an INSERT IGNORE statement from that:
with engine.begin() as cnx:
insert_sql = 'INSERT IGNORE INTO eod_data (SELECT * FROM temporary_table)'
cnx.execute(insert_sql)
just make sure the column orders are the same or you might have to manually declare them.

Pandas to_sql fails on duplicate primary key

I'd like to append to an existing table, using pandas df.to_sql() function.
I set if_exists='append', but my table has primary keys.
I'd like to do the equivalent of insert ignore when trying to append to the existing table, so I would avoid a duplicate entry error.
Is this possible with pandas, or do I need to write an explicit query?

There is unfortunately no option to specify "INSERT IGNORE". This is how I got around that limitation to insert rows into that database that were not duplicates (dataframe name is df)
for i in range(len(df)):
try:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)
except IntegrityError:
pass #or any other action

You can do this with the method parameter of to_sql:
from sqlalchemy.dialects.mysql import insert
def insert_on_duplicate(table, conn, keys, data_iter):
insert_stmt = insert(table.table).values(list(data_iter))
on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(insert_stmt.inserted)
conn.execute(on_duplicate_key_stmt)
df.to_sql('trades', dbConnection, if_exists='append', chunksize=4096, method=insert_on_duplicate)
for older versions of sqlalchemy, you need to pass a dict to on_duplicate_key_update. i.e., on_duplicate_key_stmt = insert_stmt.on_duplicate_key_update(dict(insert_stmt.inserted))

please note that the "if_exists='append'" related to the existing of the table and what to do in case the table not exists.
The if_exists don't related to the content of the table.
see the doc here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.

Pandas has no option for it currently, but here is the Github issue. If you need this feature too, just upvote for it.

The for loop method above slow things down significantly. There's a method parameter you can pass to panda.to_sql to help achieve customization for your sql query
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html#pandas.DataFrame.to_sql
The below code should work for postgres and do nothing if there's a conflict with primary key "unique_code". Change your insert dialects for your db.
def insert_do_nothing_on_conflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import table, column
columns=[]
for c in keys:
columns.append(column(c))
if sqltable.schema:
table_name = '{}.{}'.format(sqltable.schema, sqltable.name)
else:
table_name = sqltable.name
mytable = table(table_name, *columns)
insert_stmt = insert(mytable).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing(index_elements=['unique_code'])
conn.execute(do_nothing_stmt)
pd.to_sql('mytable', con=sql_engine, method=insert_do_nothing_on_conflicts)

Pandas doesn't support editing the actual SQL syntax of the .to_sql method, so you might be out of luck. There's some experimental programmatic workarounds (say, read the Dataframe to a SQLAlchemy object with CALCHIPAN and use SQLAlchemy for the transaction), but you may be better served by writing your DataFrame to a CSV and loading it with an explicit MySQL function.
CALCHIPAN repo: https://bitbucket.org/zzzeek/calchipan/

I had trouble where I was still getting the IntegrityError
...strange but I just took the above and worked it backwards:
for i, row in df.iterrows():
sql = "SELECT * FROM `Table_Name` WHERE `key` = '{}'".format(row.Key)
found = pd.read_sql(sql, con=Engine)
if len(found) == 0:
df.iloc[i:i+1].to_sql(name="Table_Name",if_exists='append',con = Engine)

In my case, I was trying to insert new data in an empty table, but some of the rows are duplicated, almost the same issue here, I "may" think about fetching existing data and merge with the new data I got and continue in process, but this is not optimal, and may work only for small data, not a huge tables.
As pandas do not provide any kind of handling for this situation right now, I was looking for a suitable workaround for this, so I made my own, not sure if that will work or not for you, but I decided to control my data first instead of luck of waiting if that worked or not, so what I did is removing duplicates before I call .to_sql so if any error happens, I know more about my data and make sure I know what is going on:
import pandas as pd
def write_to_table(table_name, data):
df = pd.DataFrame(data)
# Sort by price, so we remove the duplicates after keeping the lowest only
data.sort(key=lambda row: row['price'])
df.drop_duplicates(subset=['id_key'], keep='first', inplace=True)
#
df.to_sql(table_name, engine, index=False, if_exists='append', schema='public')
So in my case, I wanted to keep the lowest price of rows (btw I was passing an array of dict for data), and for that, I did sorting first, not necessary but this is an example of what I mean with control the data that I want to keep.
I hope this will help someone who got almost the same as my situation.

When you use SQL Server you'll get a SQL error when you enter a duplicate value into a table that has a primary key constraint. You can fix it by altering your table:
CREATE TABLE [dbo].[DeleteMe](
[id] [uniqueidentifier] NOT NULL,
[Value] [varchar](max) NULL,
CONSTRAINT [PK_DeleteMe]
PRIMARY KEY ([id] ASC)
WITH (IGNORE_DUP_KEY = ON)); <-- add
Taken from https://dba.stackexchange.com/a/111771.
Now your df.to_sql() should work again.

The solutions by Jayen and Huy Tran helped me a lot, but they didn't work straight out of the box. The problem I faced with Jayen code is that it requires that the DataFrame columns be exactly as those of the database. This was not true in my case as there were some DataFrame columns that I won't write to the database.
I modified the solution so that it considers the column names.
from sqlalchemy.dialects.mysql import insert
import itertools
def insertWithConflicts(sqltable, conn, keys, data_iter):
"""
Execute SQL statement inserting data, whilst taking care of conflicts
Used to handle duplicate key errors during database population
This is my modification of the code snippet
from https://stackoverflow.com/questions/30337394/pandas-to-sql-fails-on-duplicate-primary-key
The help page from https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.values
proved useful.
Parameters
----------
sqltable : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted. It is a zip object.
The length of it is equal to the chunck size passed in df_to_sql()
"""
vals = [dict(zip(z[0],z[1])) for z in zip(itertools.cycle([keys]),data_iter)]
insertStmt = insert(sqltable.table).values(vals)
doNothingStmt = insertStmt.on_duplicate_key_update(dict(insertStmt.inserted))
conn.execute(doNothingStmt)

I faced the same issue and I adopted the solution provided by #Huy Tran for a while, until my tables started to have schemas.
I had to improve his answer a bit and this is the final result:
def do_nothing_on_conflicts(sql_table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
sql_table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
columns = []
for c in keys:
columns.append(column(c))
if sql_table.schema:
my_table = table(sql_table.name, *columns, schema=sql_table.schema)
# table_name = '{}.{}'.format(sql_table.schema, sql_table.name)
else:
my_table = table(sql_table.name, *columns)
# table_name = sql_table.name
# my_table = table(table_name, *columns)
insert_stmt = insert(my_table).values(list(data_iter))
do_nothing_stmt = insert_stmt.on_conflict_do_nothing()
conn.execute(do_nothing_stmt)
How to use it:
history.to_sql('history', schema=schema, con=engine, method=do_nothing_on_conflicts)

The idea is the same as #Nfern's but uses recursive function to divide the df into half in each iteration to skip the row/rows causing the integrity violation.
def insert(df):
try:
# inserting into backup table
df.to_sql("table",con=engine, if_exists='append',index=False,schema='schema')
except:
rows = df.shape[0]
if rows>1:
df1 = df.iloc[:int(rows/2),:]
df2 = df.iloc[int(rows/2):,:]
insert(df1)
insert(df2)
else:
print(f"{df} not inserted. Integrity violation, duplicate primary key/s")

Python Write a number of columns to sqlite

I am working on a small project and I have created a helper function that will write a string of comma separated values to a database as if they were values. I realise there are implications to doing it this way but this is small and i need to get it going until i can do better
def db_insert(table,data):
"""
insert data into a table, the data should be a tuple
matching the number of columns with null for any columns that
have no value. False is returned on any error, error is logged to
database log file."""
if os.path.exists(database_name):
con = lite.connect(database_name)
else:
error = "Database file does not exist."
to_log(error)
return False
if con:
try:
cur = con.cursor()
data = str(data)
cur.execute('insert into %s values(%s)') % (table, data)
con.commit()
con.close()
except Exception, e:
pre_error = "Database insert raised and error;\n"
thrown_error = pre_error + str(e)
to_log(thrown_error)
finally:
con.close()
else:
error = "No connection to database"
to_log(error)
return False
database_name etc... are defined elsewhere in the script.
Barring any other obvious glaring errors;
what i need to be able to do (by this method or some other if there are suggestions) is allow somebody to create a list where each value represents a column value. As I will not know how many columns are being populated.
so somebody uses it as follows:
data = ["null", "foo","bar"]
db_insert("foo_table", data)
this insert that data into the table name foo_table. It is up to the user to know how many columns are in the table and supply the correct number of elements to satisfy that.
I realise that it is better to use sqlite parameters but there are two problems.
first you cannot use a parameter to specify the table only the values.
second is that you need to know how many values you are supplying. you have to do;
cur.execute('insert into table values(?,?,?), val1,val2,val3)
you need to be able to specify the three ?'s.
I am trying to write a general function that allows me to take an arbitrary number of values and insert them into an arbitrary table name.
Now, it was working relatively ok until i tried to pass in 'null' as a value.
One of the columns is the primary key and has an autoincrement. So passing in null will allow it to autoincrement. There will also be other instances where nulls would be required.
The problem is that python keeps wrapping my null in single quotes which sqlite complains about as a datatype mismatch as the primary key is an integer field. If I try passing None as the python null equivalent then the same thing happens.
So two problems.
How to insert an arbitrary number of columns.
How to pass a null.
Thank you for all your help on this and past questions.
Sorry, this looks like a duplicate of this
Using Python quick insert many columns into Sqlite\Mysql
my apologies I did not find it until after I wrote this.
Results in the following which works;
def db_insert(table,data):
"""
insert data into a table, the data should be a tuple
matching the number of columns with null for any columns that
have no value. False is returned on any error, error is logged to
database log file."""
if os.path.exists(database_name):
con = lite.connect(database_name)
else:
error = "Database file does not exist."
to_log(error)
return False
if con:
try:
tuple_len = len(data)
holders = ','.join('?' * tuple_len)
sql_query = 'insert into %s values({0})'.format(holders) % table
cur = con.cursor()
#data = str(data)
#cur.execute('insert into readings values(%s)') % table
cur.execute(sql_query, data)
con.commit()
con.close()
except Exception, e:
pre_error = "Database insert raised and error;\n"
thrown_error = pre_error + str(e)
to_log(thrown_error)
finally:
con.close()
else:
error = "No connection to database"
to_log(error)
return False

The second problem is a "Works for me". When I pass None as value it will correctly convert that value back and forth to and from the db.
import sqlite3
conn = sqlite3.connect("test.sqlite")
data = ("a", None)
conn.execute('INSERT INTO "foo" VALUES(' + ','.join("?" * len(data)) + ')', data)
list(conn.execute("SELECT * FROM foo")) # -> [("a", None)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Looping trough dataframe and checking rows before appending to database - python

Related

How do I update the column in postgresql python?

Value error inserting into Postgres table with psycopg2

pandas to_sql insert ignore

Pandas to_sql fails on duplicate primary key

Python Write a number of columns to sqlite

Categories

Resources