I am using pandas' to_sql method to insert data into a mysql table. The mysql table already exists and I'd like to avoid inserting duplicate rows.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
Is there a way to do this in python?
# mysql connection
import pandas as pd
import pymysql
from sqlalchemy import create_engine
user = 'user1'
pwd = 'xxxx'
host = 'aa1.us-west-1.rds.amazonaws.com'
port = 3306
database = 'main'
engine = create_engine("mysql+pymysql://{}:{}#{}/{}".format(user,pwd,host,database))
con = engine.connect()
df.to_sql(name="dfx", con=con, if_exists = 'append')
con.close()
Are there any work-arounds, if there isn't a straight forward way to do this?
It sounds like you want to do an "upsert" (insert or update). Pangres is a useful package that will allow you to do an upsert using a pandas df. If you don't want to update the row if it exists, that is also an option by setting if_row_exists to 'ignore'
I have never heard of 'upsert' before today, but it sounds interesting. You could certainly delete dupes after the data is loaded into your table.
WITH a as
(
SELECT Firstname,ROW_NUMBER() OVER(PARTITION by Firstname, empID ORDER BY Firstname)
AS duplicateRecCount
FROM dbo.tblEmployee
)
--Now Delete Duplicate Records
DELETE FROM a
WHERE duplicateRecCount > 1
That will work fine, unless you have billions of rows.
Related
Here is my code
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine("connection string")
conn_obj = engine.connect()
my_df = pd.DataFrame({'col1': ['29199'], 'date_created': ['2022-06-29 17:15:49.776867']})
my_df.to_sql('SomeSQLTable', conn_obj, if_exists='append', index = False)
I also created SomeSQLTable with script:
CREATE TABLE SomeSQLTable(
col1 nvarchar(90),
date_created datetime2)
GO
Everything runs fine, but no records are inserted into SQL table and no errors are displayed. I am not sure how to troubleshoot. conn_obj works fine, I was able to pull data.
I don't think it's exactly the answer but I don't have the privileges of commenting right now.
First of all, the pd.to_sql() returns the number of rows affected by the operation, can you please check that?
Lastly, you are defining the data types in the table creation, it could be a problem of casting the data types. I never create the table through sql as pd.to_sql() can create it if needed.
Thirdly, Please check on the table name, there could be an issue with the pascal case in some db's.
I have stored queries (like select * from table) in a Snowflake table and want to execute each query row-by-row and generate a CSV file for each query. Below is the python code where I am able to print the queries but don't know how to execute each query and create a CSV file:
I believe I am close to what I want to achieve. I would really appreciate if someone can help over here.
import pyodbc
import pandas as pd
import snowflake.connector
import os
conn = snowflake.connector.connect(
user = 'User',
password = 'Pass',
account = 'Account',
autocommit = True
)
try:
cursor = conn.cursor()
query=('Select Column from Table;')--This will return two select
statements
output = cursor.execute(query)
for i in cursor:
print(i)
cursor.close()
del cursor
conn.close()
except Exception as e:
print(e)
You're pretty close. just need to execute the code instead of printing, and put the data into a file.
I haven't used pandas much myself, but this is the code that Snowflake documentation provides for running a query and putting it into a pandas dataframe.
cursor = conn.cursor()
query=('Select Column, row_number() over(order by Column) as Rownum from Table;')
cursor.execute(query)
resultset = cursor.fetchall()
for result in resultset:
cursor.execute(result[0])
df = cursor.fetch_pandas_all()
df.to_csv(r'C:\Users\...<your filename here>'+ result[1], index = False)
may take some fiddling, but here's a couple links for references:
Snowflake docs on creating a pandas dataframe
Exporting a pandas dataframe to csv
Update: added an example of a way to create separate files for each record. This just adds a distinct number to each row of your sql output so you can use that number as part of the filename. Ultimately, you need to have some logic in your loop to create a filename, whether that's adding a random number, a timestamp, whatever. That can come from the SQL or from the python, up to you. I'd probably add a filename column to your table, but I don't know if that makes sense for you.
I know that insert or update if key exists option for .to_sql() hasn't been implemented yet, so I'm looking for an alternative.
The first thing that comes to mind is to use the append option:
data.to_sql(
"Dim_Objects",
con=connection,
if_exists="append",
index=False
)
and remove duplicates in the database separately, after I inserted data:
DELETE FROM "Dim_Objects" a
USING "Dim_Objects" b
WHERE a."Code" = b."Code"
AND a."TimeStampUpdate" < b."TimeStampUpdate"
In this case, if there's a duplicate, I only keep the latest entry.
This approach seems to work but I hoped I could achieve the same using pandas directly.
Any ideas?
can you try?
data.to_sql('Dim_Objects', con=connection, if_exists='replace')
sql = """
UPDATE "different_table" AS f
SET col1 = b.col1
FROM your_table_name AS data
WHERE a."Code" = b."Code"
"""
with engine.begin() as conn:
conn.execute(sql)
I want to import data of file "save.csv" into my actian PSQL database table "new_table" but i got error
ProgrammingError: ('42000', "[42000] [PSQL][ODBC Client Interface][LNA][PSQL][SQL Engine]Syntax Error: INSERT INTO 'new_table'<< ??? >> ('name','address','city') VALUES (%s,%s,%s) (0) (SQLPrepare)")
Below is my code:
connection = 'Driver={Pervasive ODBC Interface};server=localhost;DBQ=DEMODATA'
db = pyodbc.connect(connection)
c=db.cursor()
#create table i.e new_table
csv = pd.read_csv(r"C:\Users\user\Desktop\save.csv")
for row in csv.iterrows():
insert_command = """INSERT INTO new_table(name,address,city) VALUES (row['name'],row['address'],row['city'])"""
c.execute(insert_command)
c.commit()
Pandas have a built-in function that empty a pandas-dataframe into a sql-database called pd.to_sql(). This might be what you are looking for. Using this you dont have to manually insert one row at a time but you can insert the entire dataframe at once.
If you want to keep using your method, the issue might be that the table "new_table" hasn't been created yet in the database. And thus you first need something like this:
CREATE TABLE new_table
(
Name [nvarchar](100) NULL,
Address [nvarchar](100) NULL,
City [nvarchar](100) NULL
)
EDIT:
You can use to_sql() like this on tables that already exist in the database:
df.to_sql(
"new_table",
schema="name_of_the_schema",
con=c.session.connection(),
if_exists="append", # <--- This will append an already existing table
chunksize=10000,
index=False,
)
I have tried the same, in my case the table is created , I just want to insert each row from pandas dataframe into the database using Actian PSQL
I am trying to determine the fastest way to fetch data from MySQL into Pandas. So far, I have tried three different approaches:
Approach 1: Using pymysql and modifying field type (inspired by Fastest way to load numeric data into python/pandas/numpy array from MySQL)
import pymysql
from pymysql.converters import conversions
from pymysql.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = pymysql.connect(host = host, port = port, user= user, passwd= passwd, db= db)
Approach 2: Using MySqldb
import MySQLdb
from MySQLdb.converters import conversions
from MySQLdb.constants import FIELD_TYPE
conversions[FIELD_TYPE.DECIMAL] = float
conversions[FIELD_TYPE.NEWDECIMAL] = float
conn = MySQLdb.connect(host = host, port = port, user= user, passwd= passwd, db= db)
Approach 3: Using sqlalchemy
import sqlalchemy as SQL
engine = SQL.create_engine('mysql+mysqldb://{0}:{1}#{2}:{3}/{4}'.format(user, passwd, host, port, db))
Approach 2 is the best out of these three and takes an average of 4 seconds to fetch my table. However, fetching the table only takes 2 seconds on MySQL Workbench. How can I shave off this 2 extra seconds ? Does anyone know of any alternative ways to accomplish this ?
You can use ConnectorX library that written with rust and is about 10 times faster than pandas.
This library gets data from the database and fills the dataframe.
I think you may find answers using a specific library such as "peewee" or the function df.read_sql_query from the pandas library. To use df.read_sql_query :
MyEngine = create_engine('[YourDatabase]://[User]:[Pass]#[Host]/[DatabaseName]', echo = True)
df = pd.read_sql_query('select * from [TableName]', con= MyEngine)
Also, for uploading data from a dataframe to SQL:
df.to_sql([TableName], MyEngine, if_exists = 'append', index=False)
You must put if_exists = 'append' if the table already exists, or it will auto-default to fail. You could also put replace if you want to replace as new table as well.
For data integrity sake it's nice using dataframes for uploads and downloads due to its ability to handle data well. Depending on your size of upload, it should be pretty efficient on upload time too.
If you want to go an extra step, peewee queries may help make upload time faster, although I have not personally tested speed. Peewee is an ORM library like SQLAlchemy that I found to be very easy and expressive to develop with.
You also could use dataframes as well. Just skim over the documentation - you would construct and assign a query, then convert it to a dataframe like this:
MyQuery = [TableName]select()where([TableName.column] == "value")
df = pd.DataFrame(list(MyQuery.dicts()))
Hope this helps.