I have a PostgreSQL db. Pandas has a 'to_sql' function to write the records of a dataframe into a database. But I haven't found any documentation on how to update an existing database row using pandas when im finished with the dataframe.
Currently I am able to read a database table into a dataframe using pandas read_sql_table. I then work with the data as necessary. However I haven't been able to figure out how to write that dataframe back into the database to update the original rows.
I dont want to have to overwrite the whole table. I just need to update the rows that were originally selected.
One way is to make use of an sqlalchemy "table class" and session.merge(row), session.commit():
Here is an example:
for row in range(0, len(df)):
row_data = table_class(column_1=df.ix[i]['column_name'],
column_2=df.ix[i]['column_name'],
...
)
session.merge(row_data)
session.commit()
For sql alchemy case of read table as df, change df, then update table values based on df, I found the df.to_sql to work with name=<table_name> index=False if_exists='replace'
This should replace the old values in the table with the ones you changed in the df
Related
I am trying to do in python mass insert of data from one table to another where one of the fields is an oracle SDO_GEOMETRY(2001,4283,MDSYS.SDO_POINT_TYPE(X,Y ,NULL),NULL,NULL) using the cx_oracle library. I do have null values for this field. Is there a way to do this?
I have tried
reading the data from the source table into a pandas dataframe,
create an insert statement for the oracle cx_cursor where the variable for the value for the field is the column name in the pandas dataframe
insert using cur.executemany(sql,df.to_dict('records')) where sql is the sql staement and df is the dataframe of values to insert converted to a dictionary and the cur is the cx_oracle cursor
When you read the data in as a pandas dataframe and change the field to be of type object and then change the null fields to None using the clearNaN function below this seems to work, but only up to 100,000 rows and you have to have one row that is not null
def clearNaNs(self,df):
# this deals with NaN's in the dataframe which can not be inserted properly into an oracle dataframe. it has to be changed to None
df = df.astype(object)
df.where(pd.notnull(df), None, inplace = True)
return df
Hello I want to write my pandas dataframe to a postgresql table. I want to use the existing DB schema as well.
I am doing:
df = pd.read_csv("ss.csv")
table_name = "some_name"
engine = create_engine("postgresql://postgres:password#localhost/databasename")
df.to_sql(table_name,engine, schema="public",if_exists='replace',index=False)
This however, replaces the whole table including the column name provided by schema. I want to update the content of table but keep the column names.
Even when I tried if_exists='replace', I get "column of relation does not exist" because in the schema the column is named timestamp but the df as Timestamp
I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).
What is the most efficient way to query my SQL (T-SQL) database when I want to inner join the queried data onto a pandas dataframe afterwards?
I don't know how to pass information into SQL from Python via a PYODBC query so my current best idea is to form the query in a way that I know aligns with my Python dataframe (i.e. I know all the information has STARTDATE > 2016, so it's easy for me to request that, and I know that PRODUCT = Private_Car). However if I use:
SELECT *
FROM rmrClaim
WHERE (PRODUCT = 'Private_Car') AND (YEAR >= 2016)
I am still going to bring in far more data than necessary. What I would rather be able to do is select only data which contains my merge key (ID) from the SQL DB.
Is there a more efficient way to query the DB so that given a pandas dataframe I can only bring the data which I will need for inner joining afterwards?
Can I pass a list from python into a sql query using PYODBC?
Edit - Trying to phrase differently:
I have a dataframe from CSV (dataframe A), and I want to take data from my SQL DB to produce a dataframe (dataframe B). The data in my SQL DB is much much larger than the data in dataframe A so I want to be able to send a SQL query which only requests data that is within dataframe A so that I don't end up with a dataframe B which is 10x larger than dataframe A. My current idea for this is to use knowledge I have of dataframe A (i.e. that all of the data in dataframe A is after 2016) however if there is a way to pass a list into my SQL query I can more efficiently query a subset of data
use the pyodbc and write your query before passing it to pandas dataframe. Here is an example:
import pandas as pd
import pyodbc
connstr = "Driver={SQL Server};Server=MSSQLSERVER;Database=Claims;Trusted_Connection=yes;"
df = pd.read_sql("SELECT * FROM rmrClaim WHERE (PRODUCT = 'Private_Car') AND (YEAR >= 2016) AND ID in {} ".format(dfA.Column), pyodbc.connect(connstr))
df
I am using the below python code to update postgres DB column valuebased on Id. This loop has to run for thousands of records and it is taking longer time.
Is there a way where I can pass array of dataframe values instead of looping each row?
for i in range(0,len(df)):
QUERY=""" UPDATE "Table" SET "value"='%s' WHERE "Table"."id"='%s'
""" % (df['value'][i], df['id'][i])
cur.execute(QUERY)
conn.commit()
Depends on a library you use to communicate with PostgreSQL, but usually bulk inserts are much faster via COPY FROM command.
If you use psycopg2 it is as simple as following:
cursor.copy_from(io.StringIO(string_variable), "destination_table", columns=('id', 'value'))
Where string_variable is tab and new line delimited dataset like 1\tvalue1\n2\tvalue2\n.
To achieve a performant bulk update I would do:
Create a temporary table: CREATE TEMPORARY TABLE tmp_table;;
Insert records with copy_from;
Just update destination table with query UPDATE destination_table SET value = t.value FROM tmp_table t WHERE id = t.id or any other preferred syntax