I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?
Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()
Related
I'm trying to load database with number of tables with different number of columns from pandas into MS SQL Server, I've reached the last step:
I am dynamically creating the new tables in the database, base on the rows in pandas, fetching the data ... but I have issue with last step - adding the data into the columns, names of rows are being gathered dynamically, however I cannot pass them to the system so it would threat them as name of rows not as strings
with conn.cursor() as cursor:
cursor.execute("IF EXISTS(SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = '"+foldern[0]+"' AND TABLE_SCHEMA = 'dbo') DROP TABLE [dbo].["+foldern[0]+"];")
skuel = "CREATE TABLE " + foldern[0] + " ( "+all_columns+" ) AS NODE"
print(foldern[0])
cursor.execute(skuel)
conn.commit()
for index, row in df2.iterrows():
skuel2 = "INSERT INTO " + foldern[0] + " ( "+all_columns2+" ) values("+all_columns3+")"
cursor.execute(skuel2, *namesofrows)
Under +all_columns2+ - I do get name of columns, under values I got as many "?" as there is columns
Under "namesofrows" is a tupple that contains the names of rows from df2 (in format row.ROWNAME ), but those rows are not "resolved", what I mean by that, is that instead of having the values of rows coming from pandas, I'm having number of rows with same string which is the name of row (like row.id, row.types)
Any idea how to proceed with that?
That is good point, I was not aware of SQLAlchemy possibility which I will probably use - but - btw. I did solve the issue
resolvedvalues = []
for e in namesofrows:
resolvedvalues.append(eval(e))
cursor.execute(skuel2,(resolvedvalues))
I have a list contains many lists in python.
my_list = [['city', 'state'], ['tampa', 'florida'], ['miami','florida']]
The nested list at index 0 contains the column headers, and rest of the nested lists contain corresponding values. How would I insert this into sql server using pyodbc or slqalchemy? I have been using pandas pd.to_sql and want to make this a process in pure python. Any help would be greatly appreciated.
expected output table would look like:
city |state
-------------
tampa|florida
miami|florida
Since the column names are coming from your list you have to build a query string to insert the values. Column names and table names can't be parameterised with placeholders (?).
import pyodbc
conn = pyodbc.connect(my_connection_string)
cursor = conn.cursor()
my_list = [['city', 'state'], ['tampa', 'florida'], ['miami','florida']]
columns = ','.join(my_list[0]) #String of column names
values = ','.join(['?'] * len(my_list[0])) #Placeholders for values
query = "INSERT INTO mytable({0}) VALUES ({1})".format(columns, values)
#Loop through rest of list, inserting data
for l in my_list[1:]:
cursor.execute(query, l)
conn.commit() #save changes
Update:
If you have a large number of records to insert you can do that in one go using executemany. Change the code like this:
columns = ','.join(my_list[0]) #String of column names
values = ','.join(['?'] * len(my_list[0])) #Placeholders for values
#Bulk insert
query = "INSERT INTO mytable({0}) VALUES ({1})".format(columns, values)
cursor.executemany(query, my_list[1:])
conn.commit() #save change
Assuming conn is already open connection to your database:
cursor = conn.cursor()
for row in my_list:
cursor.execute('INSERT INTO my_table (city, state) VALUES (?, ?)', row)
cursor.commit()
Since the columns value are are the first elemnts in the array, just do:
q ="""CREATE TABLE IF NOT EXISTS stud_data (`{col1}` VARCHAR(250),`{col2}` VARCHAR(250); """
sql_cmd = q.format(col1 = my_list[0][0],col2 = my_list[0][1])
mcursor.execute(sql)#Create the table with columns
Now to add the values to the table, do:
for i in range(1,len(my_list)-1):
sql = "INSERT IGNORE into test_table(city,state) VALUES (%s, %s)"
mycursor.execute(sql,my_list[i][0],my_list[i][1])
mycursor.commit()
print(mycursor.rowcount, "Record Inserted.")#Get count of rows after insertion
I've been trying to use this piece of code:
# df is the dataframe
if len(df) > 0:
df_columns = list(df)
# create (col1,col2,...)
columns = ",".join(df_columns)
# create VALUES('%s', '%s",...) one '%s' per column
values = "VALUES({})".format(",".join(["%s" for _ in df_columns]))
#create INSERT INTO table (columns) VALUES('%s',...)
insert_stmt = "INSERT INTO {} ({}) {}".format(table,columns,values)
cur = conn.cursor()
cur = db_conn.cursor()
psycopg2.extras.execute_batch(cur, insert_stmt, df.values)
conn.commit()
cur.close()
So I could connect into Postgres DB and insert values from a df.
I get these 2 errors for this code:
LINE 1: INSERT INTO mrr.shipments (mainFreight_freight_motherVesselD...
psycopg2.errors.UndefinedColumn: column "mainfreight_freight_mothervesseldepartdatetime" of relation "shipments" does not exist
for some reason, the columns can't get the values properly
What can I do to fix it?
You should not do your own string interpolation; let psycopg2 handle it. From the docs:
Warning Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.
Since you also have dynamic column names, you should use psycopg2.sql to create the statement and then use the standard method of passing query parameters to psycopg2 instead of using format.
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)