The following Python code successfully appends the rows belonging to the pandas dataframe into an MS SQL table via the SqlAlchemy engine previously configured.
df.to_sql(schema='stg', name = 'TEST', con=engine, if_exists='append', index=False)
I want to obtain the auto-generated IDs numbers for each of the rows inserted into the stg.Test table. In other words, what is the SqlAlchemy equivalent to the Sql Server OUTPUT clause during an INSERT statement
Unfortunately, there is no easy solution to your problem like an additional parameter in your statement. You have to use the behavior that new rows get the highest id + 1 assigned. With this knowledge, you can calculate the ids of all your rows.
Option 1: Explained in this answer. You select the current maximum id, before the insert statement. Then, you assign ids to all the entries in your DataFrame greater than the previous maximum. Lastly, insert the df which already includes the ids.
Option 2: You insert the DataFrame and then acquire the highest id. With the number of entries inserted you can calculate the id of all entries. This is how such an insert function could look like:
def insert_df_and_return_ids(df, engine):
# It is important to use same connection for both statements if
# something like last_insert_rowid() is used
conn = engine.connect()
# Insert the df into the database
df.to_sql('students', conn, if_exists='append', index=False)
# Aquire the maximum id
result = conn.execute('SELECT max(id) FROM students') # Should work for all SQL variants
# result = conn.execute('Select last_insert_rowid()') # Specifically for SQLite
# result = conn.execute('Select last_insert_id()') # Specifically for MySql
entries = df.shape[0]
last_id = -1
# Iterate over result to get last inserted id
for row in result:
last_id = int(str(row[0]))
conn.close()
# Generate list of ids
list_of_ids = list(range(last_id - entries + 1, last_id + 1))
return list_of_ids
PS: I could not test the function on an MS SQL server, but the behavior should be the same. In order to test if everything behaves as it should you can use this:
import numpy as np
import pandas as pd
import sqlalchemy as sa
# Change connection to MS SQL server
engine = sa.create_engine('sqlite:///test.lite', echo=False)
# Create table
meta = sa.MetaData()
students = sa.Table(
'students', meta,
sa.Column('id', sa.Integer, primary_key = True),
sa.Column('name', sa.String),
)
meta.create_all(engine)
# DataFrame to insert with two entries
df = pd.DataFrame({'name': ['Alice', 'Bob']})
ids = insert_df_and_return_ids(df, engine)
print(ids) # [1,2]
conn = engine.connect()
# Insert any entry with a high id in order to check if new ids are always the maximum
result = conn.execute("Insert into students (id, name) VALUES (53, 'Charlie')")
conn.close()
# Insert data frame again
ids = insert_df_and_return_ids(df, engine)
print(ids) # [54, 55]
EDIT: If multiple threads are utilized, transactions can be used to make the option thread-safe at least for SQLite:
conn = engine.connect()
transaction = conn.begin()
df.to_sql('students', conn, if_exists='append', index=False)
result = conn.execute('SELECT max(id) FROM students')
transaction.commit()
I'm trying to import CSV file to SQL Server using Python. However, during last parts of code something is not working as I'd like it to. Every row looks the same after importing it to SQL Server, even though rows in DataFrame are different.
CODE:
import pypyodbc as pdb
import plotly.graph_objects as go
import numpy as np
server = 'LAPTOP-124CPEDE\SQLEXPRESS'
database = 'Datasets'
conn = pdb.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';Trusted_Connection=yes;')
cursor = conn.cursor()
data = pd.read_csv(r'C:\Users\aleks\Desktop\zadania python+sql\dataset_1.csv', sep=';')
df = pd.DataFrame(data, columns=['a_timestamp', 'any_date', 'some_money', 'weird_name', 'count_of_something'])
df
cursor.execute('CREATE TABLE dataset_1 (a_timestamp nvarchar(255), any_date date, some_money float, weird_name nvarchar(255), count_of_something float)')
tuple = (row.a_timestamp, row.any_date, row.some_money, row.weird_name, row.count_of_something)
for row in df.itertuples():
cursor.execute('''
INSERT INTO Datasets.dbo.dataset_1 (a_timestamp, any_date, some_money, weird_name, count_of_something)
VALUES (?,?,?,?,?)
''',
tuple)
conn.commit()
Instead of inserting the same row of dataframe, assigned before the for loop, you can obtain values of a dataframe's row on each iteration in for loop and insert them to table of database using this approach:
for row in df.itertuples():
values = (row.a_timestamp, row.any_date, row.some_money, row.weird_name, row.count_of_something)
cursor.execute('''
INSERT INTO Datasets.dbo.dataset_1 (a_timestamp,
any_date,
some_money,
weird_name,
count_of_something)
VALUES (?,?,?,?,?)
''',
values)
I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?
I know pandas has a to_sql function, but that only works on a database connection, it can not generate a string.
Example
What I would like is to take a dataframe like so:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
And a function that would generate this (this example is PostgreSQL but any would be fine):
CREATE TABLE data
(
index timestamp with time zone,
"A" double precision,
"B" double precision,
"C" double precision,
"D" double precision
)
If you only want the 'CREATE TABLE' sql code (and not the insert of the data), you can use the get_schema function of the pandas.io.sql module:
In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
"index" TIMESTAMP,
"A" REAL,
"B" REAL,
"C" REAL,
"D" REAL
)
Some notes:
I had to use reset_index because it otherwise didn't include the index
If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
SOURCE = df
TARGET = data
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database
import pandas as pd
sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)
return sql_text
Check the SQL CREATE TABLE Statement String
print('\n\n'.join(sql_text))
GENERATE SQL INSERT STATEMENT FROM DATAFRAME
def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
sql_texts = []
for index, row in SOURCE.iterrows():
sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))
return sql_texts
Check the SQL INSERT INTO Statement String
print('\n\n'.join(sql_texts))
Insert Statement Solution
Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows() as that is very slow. Also this takes care of nan values with the help of regular expressions.
import re
def get_insert_query_from_df(df, dest_table):
insert = """
INSERT INTO `{dest_table}` (
""".format(dest_table=dest_table)
columns_string = str(list(df.columns))[1:-1]
columns_string = re.sub(r' ', '\n ', columns_string)
columns_string = re.sub(r'\'', '', columns_string)
values_string = ''
for row in df.itertuples(index=False,name=None):
values_string += re.sub(r'nan', 'null', str(row))
values_string += ',\n'
return insert + columns_string + ')\n VALUES\n' + values_string[:-2] + ';'
If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.
As an example:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes = map(lambda x: x.name, df.dtypes.values)
# Storing column names and dtypes in a dataframe
tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])
tableDef.iloc[0] = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols'] = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes
# Defining a dictionnary to convert dtypes
conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}
# Writing sql in a file
f = open('yourdir\%s.sql' % tableName, 'w')
f.write('CREATE TABLE %s\n' % tableName)
f.write('(\n')
for i, row in tableDef.iterrows():
sep = ",\n" if i < tableDef.index[-1] else "\n"
f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep))
f.write(')')
f.close()
You can do the same way to populate your table with INSERT INTO.
SINGLE INSERT QUERY SOLUTION
I didn't find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:
import re
import pandas as pd
table = 'your_table_name_here'
# You can read from CSV file here... just using read_sql_query as an example
df = pd.read_sql_query(f'select * from {table}', con=db_connection)
cols = ', '.join(df.columns.to_list())
vals = []
for index, r in df.iterrows():
row = []
for x in r:
row.append(f"'{str(x)}'")
row_str = ', '.join(row)
vals.append(row_str)
f_values = []
for v in vals:
f_values.append(f'({v})')
# Handle inputting NULL values
f_values = ', '.join(f_values)
f_values = re.sub(r"('None')", "NULL", f_values)
sql = f"insert into {table} ({cols}) values {f_values};"
print(sql)
db.dispose()
If you're just looking to generate a string with inserts based on pandas.DataFrame - I'd suggest using bulk sql insert syntax as suggested by #rup.
Here's an example of a function I wrote for that purpose:
import pandas as pd
import re
def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
"""Converts DataFrame to bulk INSERT sql query
>>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
>>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
>>> df
id name is_deleted balance
0 1 _suffixnan 1 NaN
1 2 Noneprefix 0 NaN
2 3 fooNULLbar 1 2.34
>>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
>>> print(query)
INSERT INTO users (id, name, is_deleted, balance, status, address)
VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
(2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
(3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
"""
df = df.copy().assign(**kwargs)
columns = ", ".join(df.columns)
tuples = map(str, df.itertuples(index=False, name=None))
values = re.sub(r"(?<=\W)(nan|None)(?=\W)", "NULL", (",\n" + " " * 7).join(tuples))
return f"INSERT INTO {table} ({columns})\nVALUES {values};"
By the way, it converts nan/None entries to NULL and it's possible to pass constant column=value pairs as keyword arguments (see status="APPROVED" and address=None arguments in docstring example).
Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.
Taking the user #Jaris's post to get the CREATE, I extended it further to work for any CSV
import sqlite3
import pandas as pd
db = './database.db'
csv = './data.csv'
table_name = 'data'
# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()
# now we can insert data
def insert_data(row, c):
values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
sql_insert=f"INSERT INTO {table_name} VALUES ({values})"
try:
c.execute(sql_insert)
except Exception as e:
print(f"SQL:{sql_insert} \n failed with Error:{e}")
# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)
# finally commit all those inserts into the database
conn.commit()
Hopefully this is more simple than the alternative answers and more pythonic!
Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.
con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))
This worked a little better as there is less messing around with string generation.
I am trying to figure out an efficient way of writing a header row into a xls file from my Oracle table instead of having to do this every time, because some of my results are 50-70 columns across.
headings1 = ['Column 1', 'Column 2', etc]
rowx = 0
for colx, value in enumerate(headings1):
sheet1.write(rowx,colx,value)
My current code will only write the rows of data starting at row 2 because I have been manually creating an Excel file template that have all sheet names and header rows predefined, but it is a lot of work to create the template and I want to get rid of that part and have it automatically write Row 1 as my headers.
Import CX_Oracle
Import xlwt
Import xlutils.copy
Import xlrd
SQL = "SELECT Column1, Column2, etc from TABLE"
cursor = con.cursor()
cursor.execute(SQL)
wb_read = xlrd.open_workbook('Template.xls',formatting_info=True)
wb_read.Visible = 1
wb_write = copy(wb_read)
sheet0 = wb_write.get_sheet(0)
for i, row in enumerate(cursor):
for j, col in enumerate(row):
sheet1.write(i+1,j,col) #Starts pasting the data at row 2
book.save('Output.xls')
The current file includes 5-7 sheets that I have to write data to in the same workbook as well as 5-7 cursors being used, this is an example of the first cursor.
PEP 249 allows for a .description attribute of cursor objects, which has been implemented in cx_Oracle.
This returns a list, of tuples in which the first element of each tuple is the column name:
>>> db = cx_Oracle.connect('schema/pw#db/db')
>>> curs = db.cursor()
>>> sql = "select * from dual"
>>> curs.execute(sql)
<__builtin__.OracleCursor on <cx_Oracle.Connection to schema#db/db>>
>>> column_names = curs.description
>>> column_names
[('DUMMY', <type 'cx_Oracle.STRING'>, 1, 1, 0, 0, 1)]
>>>
To demonstrate a (very) slightly more complicated situation I created this table:
SQL> create table tmp_test (col1 number, col2 varchar2(10));
Table created.
It's then up to you how you use it:
>>> sql = "select * from tmp_test"
>>> curs.execute(sql)
<__builtin__.OracleCursor on <cx_Oracle.Connection to schema#db/db>>
>>> curs.description
[('COL1', <type 'cx_Oracle.NUMBER'>, 127, 22, 0, -127, 1), ('COL2', <type 'cx_Oracle.STRING'>, 10, 1
0, 0, 0, 1)]
>>> ','.join(c[0] for c in curs.description)
'COL1,COL2'
>>>
Just write this line before you start enumerating your cursor values.
I needed to do exactly the same thing. It was acheived the with following code:
# Write header row
for c, col in enumerate(cursor.description):
ws.write(0, c, col[0])
Then I wrote the data records using a simular for loop that you're already using.