I have an SQL table (table_1) that contains data, and I have a Python script that reads a csf and creates a dataframe.
I want to compare the dataframe with the SQL table data and then insert the missing data from the dataframe into the SQL table.
I went around and read this comparing pandas dataframe with sqlite table via sqlquery post and Compare pandas dataframe columns to sql table dataframe columns, but was not able to do it.
The table and the dataframe have the exact same columns.
The dataframe is:
import pandas as pd
df = pd.DataFrame({'userid':[1,2,3],
'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
and the SQL table (using SQLAlchemy):
userid user income
1 Bob 40000
2 Jane 42000
I'd like to compare the df to the SQL table and insert userid 3, Alice, with all her details and it's the only value missing between them.
Since you are only interested in inserting new records, and are loading from a CSV so you will have data in local memory already:
# read current userids
sql = pd.read_sql('SELECT userid FROM table_name', conn)
# keep only userids not in the sql table
df = df[~df['userid'].isin(sql['userid'])]
# insert new records
df.to_sql('table_name', conn, if_exists='append')
Other options would require first loading more data into SQL than needed.
There is still some information missing to provide a full answer. For example, what database engine do you use (SQLalchemy, sqlite3)? I assume the id is unique, and all new ids should be added?
If you are using SQLalchemy, you might take a look at pangres, which can insert and update SQL-databases from a pandas dataframe. It does however require a column with the UNIQUE property in the database (meaning that every entry in it is unique, you could set the id column UNIQUE here). This method scales better than loading all data from the database and doing the comparison in python, because only the csf data is in memory, and the database does the comparison.
If you want to do it all in Python, an option is loading the SQL-table into pandas and merging data based on the user_id columns:
import pandas as pd
df = pd.DataFrame({'userid': [0, 1, 2],'user': ['Bob', 'Jane', 'Alice'], 'income': [40000, 50000, 42000]})
sqldf = pd.read_sql_query("SELECT * FROM table_1",connection)
df = df.merge(sqldf,how='left' left_on='userid', right_on='userid')
Then you can replace the old table with the new table.
EDIT:
I saw another answer using merge, but keeping the new values and only sending them to the database. This is cleaner than the code above.
Why not just left join the tables?
conn = #your connection
df = pd.DataFrame({'userid':[1,2,3],
'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
sql = pd.read_sql("SELECT * FROM table", con = conn)
joined = pd.merge(df, sql, how = "left", on = "userid")
joined = joined[pd.isna(joined["user_y"])]
index = joined["userid"].tolist()
variable index now contains all userids that are only in df but not in sql.
To insert to database
columns = ("userid", "user", "income")
for i in index:
data = tuple(df[df["userid"] == i].values.tolist()[0])
data = [str(x) for x in data]
sql = f"""INSERT INTO table {columns}
VALUES {data}"""
conn.execute(sql)
Related
Consider the following dictionary where key is a string. (my dictionary contains 100s of tuples as follows, i need to insert all records)
dbDic[key1]={'FuelGrade': '4', 'Delivery': '7285.000', 'UpdateFlag': 0, 'Date': '2019-06-26 00:00:00', 'SiteCode': '4198', 'FileName': 'Invoices_201906251400.csv'}
The SQL string being used is :
sql = 'INSERT INTO [dbo].[SEDeliveryTemp] VALUES (?,?,?,?,?,?)'
I have to pass the values from my dbDict but my dbDict contains values as a sub dictionary. How can I pass the arguments to the string?
I have tried running the following code:
cursor.execute(sql,dbDict.values())
It gives me the error.
pyodbc.ProgrammingError: ('The SQL contains 6 parameter markers, but 1 parameters were supplied', 'HY000')
One approach is to sort the dict values based on the table list.
Ex:
tables = ["Id", "SiteCode", "FuelGrade", "Date", "Delivery", "FileName", "UpdateFlag" ]
values = {'FuelGrade': '4', 'Delivery': '7285.000', 'UpdateFlag': 0, 'Date': '2019-06-26 00:00:00', 'SiteCode': '4198', 'FileName': 'Invoices_201906251400.csv'}
values = [i[1] for i in sorted(values.items(), key=lambda x: tables.index(x[0]))]
cursor.execute(sql, values)
Depending on the size of the dictionary, if I were you I would use pandas library and DataFrame. I would create a DataFrame object from the dictionary, then insert the DataFrame into sql table using built-in to_sql() method. Here is an example;
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame.from_dict(name_of_dict)
engine = create_engine('mysql+pymysql://user:passwd#host:port/name').connect()
df.to_sql('name_of_table', con=engine, if_exists='replace')
This is a fast solution for large sets of data. (based on my personal experience, I haven't made any measurements)
I have modified my code as follows based on Rakesh answer;
def insertToTempTable() :
sql = 'INSERT INTO [NEO_DB].[dbo].[SEMobilDeliveryTemp] VALUES (?,?,?,?,?,?)'
conn = connectDB()
cursor = conn.cursor()
tables = ["Id", "SiteCode", "FuelGrade", "Date", "Delivery", "FileName", "UpdateFlag"]
for key in dbDict.keys():
values = [i[1] for i in sorted(dbDict[key].items(), key=lambda x: tables.index(x[0]))]
cursor.execute(sql, values)
conn.commit()
I want to merge an excel file with sql in pandas, here's my code
import pandas as pd
import pymysql
from sqlalchemy import create_engine
data1 = pd.read_excel('data.xlsx')
engine = create_engine('...cloudprovider.com/...')
data2 = pd.read_sql_query("select id, column3, column4 from customer", engine)
data = data1.merge(data2, on='id', how='left')
It works, just to make it clearer
If input data1.columns the output Index(['id', 'column1', 'column2'], dtype='object')
If input data2.columns the output Index(['id', 'column3', 'column4'], dtype='object')
If input data.columns the output Index(['id', 'column1', 'column2', 'column3', 'column4'], dtype='object')
Since the data2 getting bigger, I can't query entirely, so I want to query data2 with id that exist on data1. How suppose I do this?
You could leverage the fact that SQLAlchemy is a great query builder. Either reflect the customer table, or build the metadata by hand:
from sqlalchemy import MetaData, select
metadata = MetaData()
metadata.reflect(engine, only=['customer'])
customer = metadata.tables['customer']
and build your query, letting SQLAlchemy worry about proper usage of placeholders, data conversion etc. You're looking for customer rows where id is in the set of ids from data1, achieved in SQL with the IN operator:
query = select([customer.c.id,
customer.c.column3,
customer.c.column4]).\
where(customer.c.id.in_(data1['id']))
data2 = pd.read_sql_query(query, engine)
If you wish to keep on using SQL strings manually, you could build a parameterized query as such:
placeholders = ','.join(['%s'] * data1['id'].count())
# Note that you're not formatting the actual values here, but placeholders
query = f"SELECT id, column3, column4 FROM customer WHERE id IN ({placeholders})"
data2 = pd.read_sql_query(query, engine, params=data1['id'])
In general it is beneficial to learn to use placeholders instead of mixing SQL and values by formatting/concatenating strings, as it may expose you to SQL injection, if handling user generated data. Usually you'd write required placeholders in the query string directly, but some string building is required, if you have a variable amount of parameters1.
1: Some DB-API drivers, such as psycopg2, allow passing tuples and lists as scalar values and know how to construct suitable SQL.
Since you are looking into a condition as WHERE IN [Some_List]. This should work for you
id_list = data1['id'].tolist()
your_query = "select id, column3, column4 from customer where id in "+tuple(id_list)
data2 = pd.read_sql_query(your_query , engine)
Hope it works.
I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)
i am doing a lot of sql to pandas and i have run in to the following challenge.
I have a dataframe, that looks like
UserID, AccountNo, AccountName
123, 12345, 'Some name'
...
What i would like to do is for each account number, i would like to add a column called total revenue which is gotten from a mysql database, som i am thinking of something like,
for accountno in df['AccountNo']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
And i need to expand the the dataframe such that
UserID, AccountNo, AccountName, TotalRevenue
123, 12345, 'Some name', df1
...
The code that i have so far (and is not working casts a getitem error)
sets3 = []
i=0
for accountno in df5['kna1_kunnr']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
df2 = pd.DataFrame([(df5['userid'][i], df5['kna1_kunnr'][i], accountno, df5['kna1_name1'][i], df1['sum'][0])], columns=['User ID', 'AccountNo', 'tjeck', 'AccountName', 'Revenue'])
sets3.append(df2)
i += 1
df6 = pd.concat(sets3)
This idea/code is not pretty, and i wonder if there is a better/nicer way to do it, any ideas?
Consider exporting pandas data to MySQL as a temp table then run an SQL query that joins your pandas data and an aggregate query for TotalRevenue. Then, read resultset into pandas dataframe. This approach avoids any looping.
from sqlalchemy import create_engine
...
# SQL ALCHEMY CONNECTION (PREFERRED OVER RAW CONNECTION)
engine = create_engine('mysql://user:pwd#localhost/database')
# engine = create_engine("mysql+pymysql://user:pwd#hostname:port/database") # load pymysql
df1.to_sql("mypandastemptable", con=engine, if_exists='replace')
sql = """SELECT t.UserID, t.AccountNo, t.AccountName, agg.TotalRevenue
FROM mypandastemptable t
LEFT JOIN
(SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG) agg
ON t.AccountNo = agg.AccountNo)
"""
newdf = pd.read_sql(sql, con=engine)
Of course the converse is true as well, merging on two pandas dataframes of existing dataframe and the grouped aggregate query resultset:
sql = """SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG
"""
df2 = pd.read_sql(sql, con=engine)
newdf = df1.merge(df2, on='AccountNo', how='left')
Background
I'm building an application that passes data from a CSV to a MS SQL database. This database is being used as a repository for all my enterprise's records of this type (phone calls). When I run the application, it reads the CSV and converts it to a Pandas dataframe, which I then use SQLAlchemy and pyodbc to append the records to my table in SQL.
However, due to the nature of the content I'm working with, there is oftentimes data that we already have imported to the table. I am looking for a way to check if my primary key exists (a column in my SQL table and in my dataframe) before appending each record to the table.
Current code
# save dataframe to mssql DB
engine = sql.create_engine('mssql+pyodbc://CTR-HV-DEVSQL3/MasterCallDb')
df.to_sql('Calls', engine, if_exists='append')
Sample data
My CSV is imported as a pandas dataframe (primary key is FileName, its always unique), then passed to MS SQL. This is my dataframe (df):
+---+------------+-------------+
| | FileName | Name |
+---+------------+-------------+
| 1 | 123.flac | Robert |
| 2 | 456.flac | Michael |
| 3 | 789.flac | Joesph |
+---+------------+-------------+
Any ideas? Thanks!
Assuming you have no memory constraints and you're not inserting null values, you could:
sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = pd.concat((df, sql_df)).drop_duplicates(subset=['pk_1', 'pk_2', 'pk_3'], keep=False)
df = df.dropna()
df.to_sql('my_table', con=con, if_exists='append')
Depending on the application you could also reduce the size of sql_df by changing the query.
Update - Better overall and can insert null values:
sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
df = df.loc[df[pks].merge(sql_df[pks], on=pks, how='left', indicator=True)['_merge'] == 'left_only']
# df = df.drop_duplicates(subset=pks) # add it if you want to drop any duplicates that you may insert
df.to_sql('my_table', con=con, if_exists='append')
What if you iterated through rows DataFrame.iterrows() and then on each iteration used ON DUPLICATE for your key value FileName to not add it again.
You can check if is empty, like this:
sql = "SELECT pk_1, pk_2, pk_3 FROM my_table"
sql_df = pd.read_sql(sql=sql, con=con)
if sql_df.empty:
print("Is empty")
else:
print("Is not empty")
you can set parameter index=False see example bellow
data.to_sql('book_details', con = engine, if_exists = 'append', chunksize = 1000, index=False)**
If it is not set, then the command automatically adds the indexcolumn
book_details is the name of the table we want to insert our dataframe into.
Result
[SQL: INSERT INTO book_details (`index`, book_id, title, price) VALUES (%(index)s, %(book_id)s, %(title)s, %(price)s)]
[parameters: ({'index': 0, 'book_id': 55, 'title': 'Programming', 'price': 29},
{'index': 1, 'book_id': 66, 'title': 'Learn', 'price': 23},
{'index': 2, 'book_id': 77, 'title': 'Data Science', 'price': 27})]
Therefore, it needs to be in the table!!!