I want to "insert ignore" an entire pandas dataframe into mysql. Is there a way to do this without looping over the rows?
In dataframe.to_sql I only see the option if_exists 'append' but will this still continue on duplicate unique keys?
Consider using a temp table (with exact structure of final table) that is always replaced by pandas then run the INSERT IGNORE in a cursor call:
dataframe.to_sql('myTempTable', con, if_exists ='replace')
cur = con.cursor()
cur.execute("INSERT IGNORE INTO myFinalTable SELECT * FROM myTempTable")
con.commit()
There is no way to do this in pandas till the current version of pandas (0.20.3) .
The option if_exists applies only on table ( not on rows ) as stated in the documentation.
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If table exists, do nothing.
replace: If table exists, drop it, recreate it, and insert data.
append: If table exists, insert data. Create if does not exist.
Via Looping
This will slow down the process as you are inserting one row at a time
for x in xrange(data_frame.shape[0]):
try:
data_frame.iloc[x:x+1].to_sql(con=sql_engine, name="table_name", if_exists='append')
except IntegrityError:
# Your code to handle duplicates
pass
Related
I am using pandas' to_sql method to insert data into a mysql table. The mysql table already exists and I'd like to avoid inserting duplicate rows.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
Is there a way to do this in python?
# mysql connection
import pandas as pd
import pymysql
from sqlalchemy import create_engine
user = 'user1'
pwd = 'xxxx'
host = 'aa1.us-west-1.rds.amazonaws.com'
port = 3306
database = 'main'
engine = create_engine("mysql+pymysql://{}:{}#{}/{}".format(user,pwd,host,database))
con = engine.connect()
df.to_sql(name="dfx", con=con, if_exists = 'append')
con.close()
Are there any work-arounds, if there isn't a straight forward way to do this?
It sounds like you want to do an "upsert" (insert or update). Pangres is a useful package that will allow you to do an upsert using a pandas df. If you don't want to update the row if it exists, that is also an option by setting if_row_exists to 'ignore'
I have never heard of 'upsert' before today, but it sounds interesting. You could certainly delete dupes after the data is loaded into your table.
WITH a as
(
SELECT Firstname,ROW_NUMBER() OVER(PARTITION by Firstname, empID ORDER BY Firstname)
AS duplicateRecCount
FROM dbo.tblEmployee
)
--Now Delete Duplicate Records
DELETE FROM a
WHERE duplicateRecCount > 1
That will work fine, unless you have billions of rows.
I have stored queries (like select * from table) in a Snowflake table and want to execute each query row-by-row and generate a CSV file for each query. Below is the python code where I am able to print the queries but don't know how to execute each query and create a CSV file:
I believe I am close to what I want to achieve. I would really appreciate if someone can help over here.
import pyodbc
import pandas as pd
import snowflake.connector
import os
conn = snowflake.connector.connect(
user = 'User',
password = 'Pass',
account = 'Account',
autocommit = True
)
try:
cursor = conn.cursor()
query=('Select Column from Table;')--This will return two select
statements
output = cursor.execute(query)
for i in cursor:
print(i)
cursor.close()
del cursor
conn.close()
except Exception as e:
print(e)
You're pretty close. just need to execute the code instead of printing, and put the data into a file.
I haven't used pandas much myself, but this is the code that Snowflake documentation provides for running a query and putting it into a pandas dataframe.
cursor = conn.cursor()
query=('Select Column, row_number() over(order by Column) as Rownum from Table;')
cursor.execute(query)
resultset = cursor.fetchall()
for result in resultset:
cursor.execute(result[0])
df = cursor.fetch_pandas_all()
df.to_csv(r'C:\Users\...<your filename here>'+ result[1], index = False)
may take some fiddling, but here's a couple links for references:
Snowflake docs on creating a pandas dataframe
Exporting a pandas dataframe to csv
Update: added an example of a way to create separate files for each record. This just adds a distinct number to each row of your sql output so you can use that number as part of the filename. Ultimately, you need to have some logic in your loop to create a filename, whether that's adding a random number, a timestamp, whatever. That can come from the SQL or from the python, up to you. I'd probably add a filename column to your table, but I don't know if that makes sense for you.
I am trying to bulk insert pandas dataframe data into Postgresql. In Pandas dataframe I have 35 columns and in Postgresql table I have 45 columns. I am choosing 12 matching column from pandas dataframe and inserting into postgresql table. For this I am using the following code snippets:
df = pd.read_excel(raw_file_path,sheet_name = 'Sheet1',usecols=col_names) <---col_names = list of desired columns (12 columns)
cols = ','.join(list(df.columns))
tuples = [tuple(x) for x in df.to_numpy()]
query = "INSERT INTO {0}.{1} ({2}) VALUES (%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s,%%s);".format(schema_name,table_name,cols)
curr = conn.cursor()
try:
curr.executemany(query,tuples)
conn.commit()
curr.close()
except (Exception, psycopg2.DatabaseError) as error:
print("Error: %s" % error)
conn.rollback()
curr.close()
return 1
finally:
if conn is not None:
conn.close()
print('Database connection closed.')
When running I am getting this error:
SyntaxError: syntax error at or near "%"
LINE 1: ...it,purchase_group,indenter_name,wbs_code) VALUES (%s,%s,%s,%...
Even if I use ? in place of %%s I am still getting this error.
Can anybody throw some light on this?
P.S. I am using Postgresql version 10.
What you're doing now is actually insert a pandas dataframe one row at a time. Even if this worked, it would be an extremely slow operation. At the same time, if the data might contain strings, just placing them into a query string like this leaves you open to SQL injection.
I wouldn't reinvent the wheel. Pandas has a to_sql function that takes a dataframe and converts it into a query for you. You can specify what to do on conflict (when a row already exists).
It works with SQLAlchemy, which has excellent support for PostgreSQL. And even though it might be a new package to explore and install, you're not required to use it anywhere else to make this work.
from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost:5432/mydatabase')
pd.read_excel(
raw_file_path,
sheet_name = 'Sheet1',
usecols=col_names # <---col_names = list of desired columns (12 columns)
).to_sql(
schema=schema_name,
name=table_name,
con=engine,
method='multi' # this makes it do all inserts in one go
)
I want to import data of file "save.csv" into my actian PSQL database table "new_table" but i got error
ProgrammingError: ('42000', "[42000] [PSQL][ODBC Client Interface][LNA][PSQL][SQL Engine]Syntax Error: INSERT INTO 'new_table'<< ??? >> ('name','address','city') VALUES (%s,%s,%s) (0) (SQLPrepare)")
Below is my code:
connection = 'Driver={Pervasive ODBC Interface};server=localhost;DBQ=DEMODATA'
db = pyodbc.connect(connection)
c=db.cursor()
#create table i.e new_table
csv = pd.read_csv(r"C:\Users\user\Desktop\save.csv")
for row in csv.iterrows():
insert_command = """INSERT INTO new_table(name,address,city) VALUES (row['name'],row['address'],row['city'])"""
c.execute(insert_command)
c.commit()
Pandas have a built-in function that empty a pandas-dataframe into a sql-database called pd.to_sql(). This might be what you are looking for. Using this you dont have to manually insert one row at a time but you can insert the entire dataframe at once.
If you want to keep using your method, the issue might be that the table "new_table" hasn't been created yet in the database. And thus you first need something like this:
CREATE TABLE new_table
(
Name [nvarchar](100) NULL,
Address [nvarchar](100) NULL,
City [nvarchar](100) NULL
)
EDIT:
You can use to_sql() like this on tables that already exist in the database:
df.to_sql(
"new_table",
schema="name_of_the_schema",
con=c.session.connection(),
if_exists="append", # <--- This will append an already existing table
chunksize=10000,
index=False,
)
I have tried the same, in my case the table is created , I just want to insert each row from pandas dataframe into the database using Actian PSQL
Has anyone experienced this before?
I have a table with "int" and "varchar" columns - a report schedule table.
I am trying to import an excel file with ".xls" extension to this table using a python program. I am using pandas to_sql to read in 1 row of data.
Data imported is 1 row 11 columns.
Import works successfully but after the import I noticed that the datatypes in the original table have now been altered from:
int --> bigint
char(1) --> varchar(max)
varchar(30) --> varchar(max)
Any idea how I can prevent this? The switch in datatypes is causing issues in downstrean routines.
df = pd.read_excel(schedule_file,sheet_name='Schedule')
params = urllib.parse.quote_plus(r'DRIVER={SQL Server};SERVER=<<IP>>;DATABASE=<<DB>>;UID=<<UDI>>;PWD=<<PWD>>')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
table_name='REPORT_SCHEDULE'
df.to_sql(name=table_name,con=engine, if_exists='replace',index=False)
TIA
Consider using the dtype argument of pandas.DataFrame.to_sql where you pass a dictionary of SQLAlchemy types to named columns:
import sqlalchemy
...
data.to_sql(name=table_name, con=engine, if_exists='replace', index=False,
dtype={'name_of_datefld': sqlalchemy.types.DateTime(),
'name_of_intfld': sqlalchemy.types.INTEGER(),
'name_of_strfld': sqlalchemy.types.VARCHAR(length=30),
'name_of_floatfld': sqlalchemy.types.Float(precision=3, asdecimal=True),
'name_of_booleanfld': sqlalchemy.types.Boolean}
I think this has more to do with how pandas handles the table if it exists. The "replace" value to the if_exists argument tells pandas to drop your table and recreate it. But when re-creating your table, it will do it based on its own terms (and the data stored in that particular DataFrame).
While providing column datatypes will work, doing it for every such case might be cumbersome. So I would rather truncate the table in a separate statement and then just append data to it, like so:
Instead of:
df.to_sql(name=table_name, con=engine, if_exists='replace',index=False)
I'd do:
with engine.connect() as con:
con.execute("TRUNCATE TABLE %s" % table_name)
df.to_sql(name=table_name, con=engine, if_exists='append',index=False)
The truncate statement basically drops and recreates your table too, but it's done internally by the database, and the table gets recreated with the same definition.