Hello I want to write my pandas dataframe to a postgresql table. I want to use the existing DB schema as well.
I am doing:
df = pd.read_csv("ss.csv")
table_name = "some_name"
engine = create_engine("postgresql://postgres:password#localhost/databasename")
df.to_sql(table_name,engine, schema="public",if_exists='replace',index=False)
This however, replaces the whole table including the column name provided by schema. I want to update the content of table but keep the column names.
Even when I tried if_exists='replace', I get "column of relation does not exist" because in the schema the column is named timestamp but the df as Timestamp
Related
I am trying to do in python mass insert of data from one table to another where one of the fields is an oracle SDO_GEOMETRY(2001,4283,MDSYS.SDO_POINT_TYPE(X,Y ,NULL),NULL,NULL) using the cx_oracle library. I do have null values for this field. Is there a way to do this?
I have tried
reading the data from the source table into a pandas dataframe,
create an insert statement for the oracle cx_cursor where the variable for the value for the field is the column name in the pandas dataframe
insert using cur.executemany(sql,df.to_dict('records')) where sql is the sql staement and df is the dataframe of values to insert converted to a dictionary and the cur is the cx_oracle cursor
When you read the data in as a pandas dataframe and change the field to be of type object and then change the null fields to None using the clearNaN function below this seems to work, but only up to 100,000 rows and you have to have one row that is not null
def clearNaNs(self,df):
# this deals with NaN's in the dataframe which can not be inserted properly into an oracle dataframe. it has to be changed to None
df = df.astype(object)
df.where(pd.notnull(df), None, inplace = True)
return df
I want to insert a data frame into the Snowflake database table. The database has columns like id which is a primary_key and event_id which is an integer field and it's also nullable.
I have created a declarative_base() class using SQLAlchemy as shown below -
class AccountUsageLoginHistory(Base):
__tablename__ = constants.TABLE_ACCOUNT_USAGE_LOGIN_HISTORY
__table_args__ = {
'extend_existing':True,
'schema' : os.environ.get('SCHEMA_NAME_AUDITS')
}
id = Column(Integer, Sequence('id_account_usage_login_history'), primary_key=True)
event_id = Column(Integer, nullable=True)
The class stated above creates a table in the Snowflake database.
I have a data frame that has just one column event_id.
When I try to insert the data using pandas to_sql() method Snowflake returns me an error shown below -
snowflake.connector.errors.ProgrammingError: 100072 (22000): 01991f2c-0be5-c903-0000-d5e5000c6cee: NULL result in a non-nullable column
This error is generated by snowflake because to_sql() is appending a column id and the values are set to null for each row of that column.
dataframe.to_sql(table_name, self.engine, index=False, method=pd_writer, if_exists="append")
Consider this as case 1 -
I tried to run an insert query directly to snowflake -
insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (ID, EVENT_ID) values(NULL, 33)
The query above returned me the same error -
NULL result in a non-nullable column
The query stated above is how probably the to_sql() method might be doing.
Consider this as case 2 -
I also tried to insert a row by executing the query stated below -
insert into "SFOPT_TEST"."AUDITS"."ACCOUNT_USAGE_LOGIN_HISTORY" (EVENT_ID) values(33)
Now, this particular query has been executed successfully by inserting the data into the table and it has also auto-generated value for column id.
How can I make to_sql() method of pandas to use case 2?
Please note that pandas.DataFrame.to_sql() has by default parameter index=True which means that it will add an extra column (df.index) when inserting the data.
Some Databases like PostgreSQL have a data type serial which allows you to sequentially fill the column with incremental numbers.
Snowflake DB doesn't have that concept but instead, there are other ways to handle it:
First Option:
You can use CREATE SEQUENCE statement and create a sequence directly in the db - here is the official documentation on this topic. The downside of this approach is that you would need to convert your DataFrame into a proper SQL statement:
db preparation part:
CREATE OR REPLACE SEQUENCE schema.my_sequence START = 1 INCREMENT = 1;
CREATE OR REPLACE TABLE schema.my_table (i bigint, b text);
You would need to convert the DataFrame into Snowflake's INSERT statement and use schema.my_sequence.nextval to get the next ID value
INSERT INTO schema.my_table VALUES
(schema.my_sequence.nextval, 'string_1'),
(schema.my_sequence.nextval, 'string_2');
The result will be:
i b
1 string_1
2 string_2
Please note that there are some limitations to this approach and you need to ensure that each insert statement you do this way will be successful as calling schema.my_sequence.nextval and not inserting it will mean that there will be gaps numbers.
To avoid it you can have a separate script that checks if the current insert was successful and if not it will recreate the sequence by calling:
REPLACE SEQUENCE schema.my_sequence start = (SELECT max(i) FROM schema.my_table) increment = 1;
Alternative Option:
You would need to create an extra function that runs the SQL to get the last i you inserted previously.
SELECT max(i) AS max_i FROM schema.my_table;
and then update the index in your DataFrame before running to_sql()
df.index = range(max_i+1, len(df)+max_i+1)
This will ensure that your DataFrame index continues i in your table.
Once that is done you can use
df.to_sql(index_label='i', name='my_table', con=connection_object)
It will use your index as one of the columns you insert allowing you to maintain the unique index in the table.
I have a dataframe that I upload to a SQL server table. I am using sqlalchemy & the to_sql method.
The data uploads into the table perfectly. Currently I have designed it so that my column names in my dataframe and sql table are the same. However I was wondering if this needs to be the case? Is there a way that when your dataframe has a different column name to the sql table that you can specify some mapping? Or do you just simply rename the column name in your dataframe?
from sqlalchemy import create_engine
engine = create_engine(engine_str)
conn = engine.connect()
df.to_sql(tbl_name, conn, if_exists='append', index=False)
I've had this situation when transferring data between tables, I've used pandas.DataFrame.rename to map one set of columns to another before pushing the dataframe back to SQL.
So, for example let's say that one table has the columns: Name, IPAddress, Folder
And your second table has the columns: name, ip, folder
You could read the first table with sqlalchemy into a dataframe:
source_data = pd.read_sql_table(source_table, con=engine)
Then create a conversion dictionary to convert the columns:
conv_dict = {
'Name': 'name',
'IPAddress': 'ip',
'Folder': 'folder'
}
# convert the columns into a new datframe
new_df = source_data.rename(columns=conv_dict)
Now you can put that new dataframe with the converted columns into your second table:
new_df.to_sql(dest_table, con=engine, if_exists='append', index=False)
Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
Has anyone experienced this before?
I have a table with "int" and "varchar" columns - a report schedule table.
I am trying to import an excel file with ".xls" extension to this table using a python program. I am using pandas to_sql to read in 1 row of data.
Data imported is 1 row 11 columns.
Import works successfully but after the import I noticed that the datatypes in the original table have now been altered from:
int --> bigint
char(1) --> varchar(max)
varchar(30) --> varchar(max)
Any idea how I can prevent this? The switch in datatypes is causing issues in downstrean routines.
df = pd.read_excel(schedule_file,sheet_name='Schedule')
params = urllib.parse.quote_plus(r'DRIVER={SQL Server};SERVER=<<IP>>;DATABASE=<<DB>>;UID=<<UDI>>;PWD=<<PWD>>')
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = create_engine(conn_str)
table_name='REPORT_SCHEDULE'
df.to_sql(name=table_name,con=engine, if_exists='replace',index=False)
TIA
Consider using the dtype argument of pandas.DataFrame.to_sql where you pass a dictionary of SQLAlchemy types to named columns:
import sqlalchemy
...
data.to_sql(name=table_name, con=engine, if_exists='replace', index=False,
dtype={'name_of_datefld': sqlalchemy.types.DateTime(),
'name_of_intfld': sqlalchemy.types.INTEGER(),
'name_of_strfld': sqlalchemy.types.VARCHAR(length=30),
'name_of_floatfld': sqlalchemy.types.Float(precision=3, asdecimal=True),
'name_of_booleanfld': sqlalchemy.types.Boolean}
I think this has more to do with how pandas handles the table if it exists. The "replace" value to the if_exists argument tells pandas to drop your table and recreate it. But when re-creating your table, it will do it based on its own terms (and the data stored in that particular DataFrame).
While providing column datatypes will work, doing it for every such case might be cumbersome. So I would rather truncate the table in a separate statement and then just append data to it, like so:
Instead of:
df.to_sql(name=table_name, con=engine, if_exists='replace',index=False)
I'd do:
with engine.connect() as con:
con.execute("TRUNCATE TABLE %s" % table_name)
df.to_sql(name=table_name, con=engine, if_exists='append',index=False)
The truncate statement basically drops and recreates your table too, but it's done internally by the database, and the table gets recreated with the same definition.
I have a PostgreSQL db. Pandas has a 'to_sql' function to write the records of a dataframe into a database. But I haven't found any documentation on how to update an existing database row using pandas when im finished with the dataframe.
Currently I am able to read a database table into a dataframe using pandas read_sql_table. I then work with the data as necessary. However I haven't been able to figure out how to write that dataframe back into the database to update the original rows.
I dont want to have to overwrite the whole table. I just need to update the rows that were originally selected.
One way is to make use of an sqlalchemy "table class" and session.merge(row), session.commit():
Here is an example:
for row in range(0, len(df)):
row_data = table_class(column_1=df.ix[i]['column_name'],
column_2=df.ix[i]['column_name'],
...
)
session.merge(row_data)
session.commit()
For sql alchemy case of read table as df, change df, then update table values based on df, I found the df.to_sql to work with name=<table_name> index=False if_exists='replace'
This should replace the old values in the table with the ones you changed in the df