How can i append dataframe from pandas to the oracle table? - python

I want to append dataframe (pandas) to my table in oracle.
But this code deletes all rows in table:(
My dataframe and my result become this:
0, 0, 0, ML_TEST, 0, 5
0, 0, 0, ML_TEST, 0, 6
by this code block below :
import cx_Oracle
import pandas as pd
from sqlalchemy import types, create_engine
dataset = pd.read_csv("denemedf.txt", delimiter=",")
print(dataset)
from sqlalchemy import create_engine
engine = create_engine('oracle://***:***#***:***/***', echo=False)
dataset.to_sql(name='dev_log',con=engine ,if_exists = 'append', index=False)
How can i append this dataframe's rows to the last of the table without deleting existing rows in this table?
Now i tried again, now appends to the last, but in first try it deleted all existing rows.
How can do this effectively without causing any problem?
Actually problem occurs because of the schema of this table.
This table is in gnl owner, but i connected with prg. So it couldnt find the table and created another.
Is that any way to write owner or schema in to this function?

I think this may help :
import cx_Oracle
import pandas as pd
dataset = pd.read_csv("C:\\pathToFile\\denemedf.txt", delimiter=",")
con = cx_Oracle.connect('uname/pwd#serverName:port/instanceName')
cursor = con.cursor()
sql='INSERT INTO gnl.tbl_deneme VALUES(:1,:2,:3,:4,:5,:6)'
df_list = dataset.values.tolist()
n = 0
for i in dataset.iterrows():
cursor.execute(sql,df_list[n])
n += 1
con.commit()
cursor.close
con.close
provided insert privilege is already granted to the schema prg for your table tbl_deneme
( after connecting to gnl -> grant insert on tbl_deneme to prg )
where your text file( denemedf.txt ) is assumed to be
col1,col2,col3,col4,col5,col6
0, 0, 0, ML_TEST, 0, 5
0, 0, 0, ML_TEST, 0, 6
Moreover, a dynamic option, which will create a table if not exists by using the column names at the first line and insert the values depending on the splitted elements of the values list derived from the second line without explicitly specified the variable list one by one, and more performant option along with using cursor.executemany might be provided such as
import cx_Oracle
import pandas as pd
con = cx_Oracle.connect(user, password, host+':'+port+'/'+dbname)
cur = con.cursor()
tab_name = 'gnl.tbl_deneme'
cursor.execute('SELECT COUNT(*) FROM user_tables WHERE table_name = UPPER(:1) ',[tab_name])
exs = cursor.fetchone()[0]
df = pd.read_csv('C:\\pathToFile\\denemedf.txt', sep = ',', dtype=str)
col=df.columns.tolist()
crt=""
for k in col:
crt += ''.join(k)+' VARCHAR2(4000),'
if int(exs) == 0:
crt = 'CREATE TABLE '+tab_name+' ('+crt.rstrip(",")+')'
cursor.execute(crt)
vrs=""
for i in range(0,len(col)):
vrs+= ':'+str(i+1)+','
cols=[]
sql = 'INSERT INTO '+tab_name+' VALUES('+vrs.rstrip(",")+')'
for i in range(0,len(df)):
cols.append(tuple(df.fillna('').values[i]))
cursor.executemany(sql,cols)
con.commit()
cursor.close
con.close

Considering data_df to be dataframe it can be done by below 3 lines
rows = [tuple(x) for x in data_df.values]
cur.executemany("INSERT INTO table_name VALUES (:1,:2,:3,:4)",rows)
con_ora.commit()

dataset.to_sql('dev_log',engine ,if_exists = 'append', index=False)
dev_log use directly as a table name and engine directly for connection not name='dev_log' and con = engine is
paramenter: append: Insert new values to the existing table
so i think it will work for appending new row to the existing table and it will not delte any row from existing table
pandas.DataFrame.to_sql

Related

How to perform an SQL update on multiple rows using Python

I am retrieving data from a Postgresdb and storing it in a Pandas dataframe for further processing. While doing that I want to update the queried table and set a flag saying that these rows are getting processed.
engine = engine = create_engine(connection_string, connect_args=credentials)
query = load_query(filename='queries/get_data.sql')
df = pd.read_sql(query, engine)
ids = df['id']
update_query = "update table1" +\
"set status = 'processing'," +\
f"where session_id in ({ids})"
with engine.connect() as con:
rs = con.execute(update_query)
The dataframe then looks like this:
ID
descr
Cell 1
Cell 2
Cell 3
Cell 4
Now I want to update the column "status". What do I need to do? I know I need a list, separeted by commas and each value in qoutes... But I wasnt able to build id.
Help appreciated

Obtain list of IDs inserted from pandas to_sql function

The following Python code successfully appends the rows belonging to the pandas dataframe into an MS SQL table via the SqlAlchemy engine previously configured.
df.to_sql(schema='stg', name = 'TEST', con=engine, if_exists='append', index=False)
I want to obtain the auto-generated IDs numbers for each of the rows inserted into the stg.Test table. In other words, what is the SqlAlchemy equivalent to the Sql Server OUTPUT clause during an INSERT statement
Unfortunately, there is no easy solution to your problem like an additional parameter in your statement. You have to use the behavior that new rows get the highest id + 1 assigned. With this knowledge, you can calculate the ids of all your rows.
Option 1: Explained in this answer. You select the current maximum id, before the insert statement. Then, you assign ids to all the entries in your DataFrame greater than the previous maximum. Lastly, insert the df which already includes the ids.
Option 2: You insert the DataFrame and then acquire the highest id. With the number of entries inserted you can calculate the id of all entries. This is how such an insert function could look like:
def insert_df_and_return_ids(df, engine):
# It is important to use same connection for both statements if
# something like last_insert_rowid() is used
conn = engine.connect()
# Insert the df into the database
df.to_sql('students', conn, if_exists='append', index=False)
# Aquire the maximum id
result = conn.execute('SELECT max(id) FROM students') # Should work for all SQL variants
# result = conn.execute('Select last_insert_rowid()') # Specifically for SQLite
# result = conn.execute('Select last_insert_id()') # Specifically for MySql
entries = df.shape[0]
last_id = -1
# Iterate over result to get last inserted id
for row in result:
last_id = int(str(row[0]))
conn.close()
# Generate list of ids
list_of_ids = list(range(last_id - entries + 1, last_id + 1))
return list_of_ids
PS: I could not test the function on an MS SQL server, but the behavior should be the same. In order to test if everything behaves as it should you can use this:
import numpy as np
import pandas as pd
import sqlalchemy as sa
# Change connection to MS SQL server
engine = sa.create_engine('sqlite:///test.lite', echo=False)
# Create table
meta = sa.MetaData()
students = sa.Table(
'students', meta,
sa.Column('id', sa.Integer, primary_key = True),
sa.Column('name', sa.String),
)
meta.create_all(engine)
# DataFrame to insert with two entries
df = pd.DataFrame({'name': ['Alice', 'Bob']})
ids = insert_df_and_return_ids(df, engine)
print(ids) # [1,2]
conn = engine.connect()
# Insert any entry with a high id in order to check if new ids are always the maximum
result = conn.execute("Insert into students (id, name) VALUES (53, 'Charlie')")
conn.close()
# Insert data frame again
ids = insert_df_and_return_ids(df, engine)
print(ids) # [54, 55]
EDIT: If multiple threads are utilized, transactions can be used to make the option thread-safe at least for SQLite:
conn = engine.connect()
transaction = conn.begin()
df.to_sql('students', conn, if_exists='append', index=False)
result = conn.execute('SELECT max(id) FROM students')
transaction.commit()

Can't fill rows correctly using for loop during CSV import to SQL Server

I'm trying to import CSV file to SQL Server using Python. However, during last parts of code something is not working as I'd like it to. Every row looks the same after importing it to SQL Server, even though rows in DataFrame are different.
CODE:
import pypyodbc as pdb
import plotly.graph_objects as go
import numpy as np
server = 'LAPTOP-124CPEDE\SQLEXPRESS'
database = 'Datasets'
conn = pdb.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';Trusted_Connection=yes;')
cursor = conn.cursor()
data = pd.read_csv(r'C:\Users\aleks\Desktop\zadania python+sql\dataset_1.csv', sep=';')
df = pd.DataFrame(data, columns=['a_timestamp', 'any_date', 'some_money', 'weird_name', 'count_of_something'])
df
cursor.execute('CREATE TABLE dataset_1 (a_timestamp nvarchar(255), any_date date, some_money float, weird_name nvarchar(255), count_of_something float)')
tuple = (row.a_timestamp, row.any_date, row.some_money, row.weird_name, row.count_of_something)
for row in df.itertuples():
cursor.execute('''
INSERT INTO Datasets.dbo.dataset_1 (a_timestamp, any_date, some_money, weird_name, count_of_something)
VALUES (?,?,?,?,?)
''',
tuple)
conn.commit()
Instead of inserting the same row of dataframe, assigned before the for loop, you can obtain values of a dataframe's row on each iteration in for loop and insert them to table of database using this approach:
for row in df.itertuples():
values = (row.a_timestamp, row.any_date, row.some_money, row.weird_name, row.count_of_something)
cursor.execute('''
INSERT INTO Datasets.dbo.dataset_1 (a_timestamp,
any_date,
some_money,
weird_name,
count_of_something)
VALUES (?,?,?,?,?)
''',
values)

Generate SQL statements from a Pandas Dataframe

I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way to do this?
I know pandas has a to_sql function, but that only works on a database connection, it can not generate a string.
Example
What I would like is to take a dataframe like so:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
And a function that would generate this (this example is PostgreSQL but any would be fine):
CREATE TABLE data
(
index timestamp with time zone,
"A" double precision,
"B" double precision,
"C" double precision,
"D" double precision
)
If you only want the 'CREATE TABLE' sql code (and not the insert of the data), you can use the get_schema function of the pandas.io.sql module:
In [10]: print pd.io.sql.get_schema(df.reset_index(), 'data')
CREATE TABLE "data" (
"index" TIMESTAMP,
"A" REAL,
"B" REAL,
"C" REAL,
"D" REAL
)
Some notes:
I had to use reset_index because it otherwise didn't include the index
If you provide an sqlalchemy engine of a certain database flavor, the result will be adjusted to that flavor (eg the data type names).
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
SOURCE = df
TARGET = data
GENERATE SQL CREATE STATEMENT FROM DATAFRAME
def SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
# SQL_CREATE_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET)
# SOURCE: source dataframe
# TARGET: target table to be created in database
import pandas as pd
sql_text = pd.io.sql.get_schema(SOURCE.reset_index(), TARGET)
return sql_text
Check the SQL CREATE TABLE Statement String
print('\n\n'.join(sql_text))
GENERATE SQL INSERT STATEMENT FROM DATAFRAME
def SQL_INSERT_STATEMENT_FROM_DATAFRAME(SOURCE, TARGET):
sql_texts = []
for index, row in SOURCE.iterrows():
sql_texts.append('INSERT INTO '+TARGET+' ('+ str(', '.join(SOURCE.columns))+ ') VALUES '+ str(tuple(row.values)))
return sql_texts
Check the SQL INSERT INTO Statement String
print('\n\n'.join(sql_texts))
Insert Statement Solution
Not sure if this is the absolute best way to do it but this is more efficient than using df.iterrows() as that is very slow. Also this takes care of nan values with the help of regular expressions.
import re
def get_insert_query_from_df(df, dest_table):
insert = """
INSERT INTO `{dest_table}` (
""".format(dest_table=dest_table)
columns_string = str(list(df.columns))[1:-1]
columns_string = re.sub(r' ', '\n ', columns_string)
columns_string = re.sub(r'\'', '', columns_string)
values_string = ''
for row in df.itertuples(index=False,name=None):
values_string += re.sub(r'nan', 'null', str(row))
values_string += ',\n'
return insert + columns_string + ')\n VALUES\n' + values_string[:-2] + ';'
If you want to write the file by yourself, you may also retrieve columns names and dtypes and build a dictionary to convert pandas data types to sql data types.
As an example:
import pandas as pd
import numpy as np
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
tableName = 'table'
columnNames = df.columns.values.tolist()
columnTypes = map(lambda x: x.name, df.dtypes.values)
# Storing column names and dtypes in a dataframe
tableDef = pd.DataFrame(index = range(len(df.columns) + 1), columns=['cols', 'dtypes'])
tableDef.iloc[0] = ['index', df.index.dtype.name]
tableDef.loc[1:, 'cols'] = columnNames
tableDef.loc[1:, 'dtypes'] = columnTypes
# Defining a dictionnary to convert dtypes
conversion = {'datetime64[ns]':'timestamp with time zone', 'float64':'double precision'}
# Writing sql in a file
f = open('yourdir\%s.sql' % tableName, 'w')
f.write('CREATE TABLE %s\n' % tableName)
f.write('(\n')
for i, row in tableDef.iterrows():
sep = ",\n" if i < tableDef.index[-1] else "\n"
f.write('\t\"%s\" %s%s' % (row['cols'], conversion[row['dtypes']], sep))
f.write(')')
f.close()
You can do the same way to populate your table with INSERT INTO.
SINGLE INSERT QUERY SOLUTION
I didn't find the above answers to suit my needs. I wanted to create one single insert statement for a dataframe with each row as the values. This can be achieved by the below:
import re
import pandas as pd
table = 'your_table_name_here'
# You can read from CSV file here... just using read_sql_query as an example
df = pd.read_sql_query(f'select * from {table}', con=db_connection)
cols = ', '.join(df.columns.to_list())
vals = []
for index, r in df.iterrows():
row = []
for x in r:
row.append(f"'{str(x)}'")
row_str = ', '.join(row)
vals.append(row_str)
f_values = []
for v in vals:
f_values.append(f'({v})')
# Handle inputting NULL values
f_values = ', '.join(f_values)
f_values = re.sub(r"('None')", "NULL", f_values)
sql = f"insert into {table} ({cols}) values {f_values};"
print(sql)
db.dispose()
If you're just looking to generate a string with inserts based on pandas.DataFrame - I'd suggest using bulk sql insert syntax as suggested by #rup.
Here's an example of a function I wrote for that purpose:
import pandas as pd
import re
def df_to_sql_bulk_insert(df: pd.DataFrame, table: str, **kwargs) -> str:
"""Converts DataFrame to bulk INSERT sql query
>>> data = [(1, "_suffixnan", 1), (2, "Noneprefix", 0), (3, "fooNULLbar", 1, 2.34)]
>>> df = pd.DataFrame(data, columns=["id", "name", "is_deleted", "balance"])
>>> df
id name is_deleted balance
0 1 _suffixnan 1 NaN
1 2 Noneprefix 0 NaN
2 3 fooNULLbar 1 2.34
>>> query = df_to_sql_bulk_insert(df, "users", status="APPROVED", address=None)
>>> print(query)
INSERT INTO users (id, name, is_deleted, balance, status, address)
VALUES (1, '_suffixnan', 1, NULL, 'APPROVED', NULL),
(2, 'Noneprefix', 0, NULL, 'APPROVED', NULL),
(3, 'fooNULLbar', 1, 2.34, 'APPROVED', NULL);
"""
df = df.copy().assign(**kwargs)
columns = ", ".join(df.columns)
tuples = map(str, df.itertuples(index=False, name=None))
values = re.sub(r"(?<=\W)(nan|None)(?=\W)", "NULL", (",\n" + " " * 7).join(tuples))
return f"INSERT INTO {table} ({columns})\nVALUES {values};"
By the way, it converts nan/None entries to NULL and it's possible to pass constant column=value pairs as keyword arguments (see status="APPROVED" and address=None arguments in docstring example).
Generally, it works faster since any database does a lot of work for a single insert: checking the constraints, building indices, flushing, writing to log, etc. This complex operations can be optimized by the database when doing several-in-one operation, and not calling the engine one-by-one.
Taking the user #Jaris's post to get the CREATE, I extended it further to work for any CSV
import sqlite3
import pandas as pd
db = './database.db'
csv = './data.csv'
table_name = 'data'
# create db and setup schema
df = pd.read_csv(csv)
create_table_sql = pd.io.sql.get_schema(df.reset_index(), table_name)
conn = sqlite3.connect(db)
c = conn.cursor()
c.execute(create_table_sql)
conn.commit()
# now we can insert data
def insert_data(row, c):
values = str(row.name)+','+','.join([str('"'+str(v)+'"') for v in row])
sql_insert=f"INSERT INTO {table_name} VALUES ({values})"
try:
c.execute(sql_insert)
except Exception as e:
print(f"SQL:{sql_insert} \n failed with Error:{e}")
# use apply to loop over dataframe and call insert_data on each row
df.apply(lambda row: insert_data(row, c), axis=1)
# finally commit all those inserts into the database
conn.commit()
Hopefully this is more simple than the alternative answers and more pythonic!
Depending on if you can forego generating an intermediate representation of the SQL statement; You can just outright execute the insert statement as well.
con.executemany("INSERT OR REPLACE INTO data (A, B, C, D) VALUES (?, ?, ?, ?, ?)", list(df_.values))
This worked a little better as there is less messing around with string generation.

Write a header row into a Excel based on an Oracle query

I am trying to figure out an efficient way of writing a header row into a xls file from my Oracle table instead of having to do this every time, because some of my results are 50-70 columns across.
headings1 = ['Column 1', 'Column 2', etc]
rowx = 0
for colx, value in enumerate(headings1):
sheet1.write(rowx,colx,value)
My current code will only write the rows of data starting at row 2 because I have been manually creating an Excel file template that have all sheet names and header rows predefined, but it is a lot of work to create the template and I want to get rid of that part and have it automatically write Row 1 as my headers.
Import CX_Oracle
Import xlwt
Import xlutils.copy
Import xlrd
SQL = "SELECT Column1, Column2, etc from TABLE"
cursor = con.cursor()
cursor.execute(SQL)
wb_read = xlrd.open_workbook('Template.xls',formatting_info=True)
wb_read.Visible = 1
wb_write = copy(wb_read)
sheet0 = wb_write.get_sheet(0)
for i, row in enumerate(cursor):
for j, col in enumerate(row):
sheet1.write(i+1,j,col) #Starts pasting the data at row 2
book.save('Output.xls')
The current file includes 5-7 sheets that I have to write data to in the same workbook as well as 5-7 cursors being used, this is an example of the first cursor.
PEP 249 allows for a .description attribute of cursor objects, which has been implemented in cx_Oracle.
This returns a list, of tuples in which the first element of each tuple is the column name:
>>> db = cx_Oracle.connect('schema/pw#db/db')
>>> curs = db.cursor()
>>> sql = "select * from dual"
>>> curs.execute(sql)
<__builtin__.OracleCursor on <cx_Oracle.Connection to schema#db/db>>
>>> column_names = curs.description
>>> column_names
[('DUMMY', <type 'cx_Oracle.STRING'>, 1, 1, 0, 0, 1)]
>>>
To demonstrate a (very) slightly more complicated situation I created this table:
SQL> create table tmp_test (col1 number, col2 varchar2(10));
Table created.
It's then up to you how you use it:
>>> sql = "select * from tmp_test"
>>> curs.execute(sql)
<__builtin__.OracleCursor on <cx_Oracle.Connection to schema#db/db>>
>>> curs.description
[('COL1', <type 'cx_Oracle.NUMBER'>, 127, 22, 0, -127, 1), ('COL2', <type 'cx_Oracle.STRING'>, 10, 1
0, 0, 0, 1)]
>>> ','.join(c[0] for c in curs.description)
'COL1,COL2'
>>>
Just write this line before you start enumerating your cursor values.
I needed to do exactly the same thing. It was acheived the with following code:
# Write header row
for c, col in enumerate(cursor.description):
ws.write(0, c, col[0])
Then I wrote the data records using a simular for loop that you're already using.

Categories