How to updated a table (accessed in pandas) in DuckDB Database? - python

I'm working on one of use case, I have a larger volumes of records created in a duckdb database table, these tables can be accessed in pandas dataframe, do the data manipulations and send them back to DB table. here below I will explain my case.
I have a DB called as MY_DB in Duck DB and a table in it as ROLL_TABLE_A, here it will be queried and converted as a Pandas data frame as _DF.
The same table (ROLL_TABLE_A) can be accessed by multiple users and do the required updates on the data frame _DF.
How to upload the data frame _DF into the same table ROLL_TABLE_A?.
Steps to reproduce:
# Connection and cursor creations
dbas_db_con = duckdb.connect('MY_DB.db')
# list of DB TABLE
dbas_db_con.execute("SHOW TABLES").df()
# Query on DB Table
dbas_db_con.execute("SELECT *FROM ROLL_TABLE_A").df()
# convert database table to pandas table
_df = dbas_db_con.execute("SELECT *FROM ROLL_TABLE_A").df()
Here on _df id field is filled up with multiple user and after updating pandas dataframe it will be as.
Here the updated dataframe to be updated in ROLL_TABLE_A Table in DuckDB.
dbas_db_con.execute("SELECT *FROM ROLL_TABLE_A").df()
On accessing a ROLL_TABLE_A it will produce an output as

Here is a function that takes a dataframe, table name and database path as input and writes the dataframe to the table:
def df_to_duckdb(df:pd.DataFrame, table:str, db_path:str):
con = duckdb.connect(database=db_path, read_only=False)
# register the df in the database so it can be queried
con.register("df", df)
query = f"create or replace table {table} as select * from df"
con.execute(f"{query}")
con.close()
The part that took me a while to figure out is registering the df as a relation (e.g. table/view) in the database. Registering does not write the df to the database, but is essentially a pointer within the database that references the df in memory.

Related

Deleting duplicate rows in "large" sqlite table takes too much time (Python)

I have a relatively small sqlite3 database (~2.6GB) with 820k rows and 26 columns (single table). I run an iterative process, and every time new data is generated, the data is placed in a pandas dataframe, and inserted into the sqlite database with the function insert_values_to_table. This process operates fine and is very fast.
After every data insert, the database is sanitized from its duplicate row listings (all 26 columns need to be duplicate) with the function sanitize_database. This operation connects to the database in similar fashion, creates a cursor, and executes the following logic: Create new temporary_table with only unique values from original table --> Delete all rows from original table --> Insert all rows from temporary table into empty original table --> Drop the temporary table.
It works, but the sanitize_database function is extremely slow, and can easily take up to an hour for even this small dataset. I tried to set a certain column as primary key, or to unique value, however, pandas.DataFrame.to_sql does not allow for this operation as it can either insert the whole dataframe at once, or none at all. That functionality can be reviewed here (append_skipdupes).
Is there a method to make this process more efficient?
#Function to insert pandas dataframes into SQLITE3 database
def insert_values_to_table(table_name, output):
conn = connect_to_db("/mnt/wwn-0x5002538e00000000-part1/DATABASE/table_name.db")
#if connection exists perform data insertion
if conn is not None:
c = conn.cursor()
#Add pandas data (output) into sql database
output.to_sql(name=table_name, con=conn, if_exists='append', index=False)
#Close connection
conn.close()
print('SQL insert process finished')
#To keep only unique rows in SQLITE3 database
def sanitize_database():
conn = connect_to_db("/mnt/wwn-0x5002538e00000000-part1/DATABASE/table_name.db")
c = conn.cursor()
c.executescript("""
CREATE TABLE temp_table as SELECT DISTINCT * FROM table_name;
DELETE FROM table_name;
INSERT INTO table_name SELECT * FROM temp_table;
DROP TABLE temp_table
""")
conn.close()

pandas to_sql() replace table but keep column names

Hello I want to write my pandas dataframe to a postgresql table. I want to use the existing DB schema as well.
I am doing:
df = pd.read_csv("ss.csv")
table_name = "some_name"
engine = create_engine("postgresql://postgres:password#localhost/databasename")
df.to_sql(table_name,engine, schema="public",if_exists='replace',index=False)
This however, replaces the whole table including the column name provided by schema. I want to update the content of table but keep the column names.
Even when I tried if_exists='replace', I get "column of relation does not exist" because in the schema the column is named timestamp but the df as Timestamp

Incorporate Pandas Data Frame in Query to Database

I'm trying to figure out how to treat a pandas data frame as a SQL table when querying a database in Python.
I'm coming from a SAS background where work tables can easily be incorporated into direct database queries.
For example:
Select a.first_col,
b.second_col
from database.table1 a
left join work.table1 b on a.id = b.id;
Here work.table1 is not in the database, but is a dataset held in the local SAS server.
In my research I have found ways to write a data frame to a database and then include that in the query. I do not have write access to the database, so that is not an option for me.
I also know that I can use sqlalchemy with pd.to_sql() to put a data frame into a SQL engine, but I can't figure out if there is a way to connect that engine with the pyodbc connection I have with the database.
I also tried this though I didn't think it would work (names of tables and columns altered).
df = pd.DataFrame([A342,B432,W345],columns=['id'])
query = '''
select a.id, b.id
from df a
left join database.base_table b on a.id= b.id
'''
query_results = pd.read_sql_query(query,connection)
As I expected it didn't work.
I'm connecting to a Netezza database, I'm not sure if that matters.
I don't think it is possible, it would have to be written to its own table in order to be queried, although I'm not familiar with Netezza.
Why not perform the join (in pandas a "merge") purely in pandas? Understand if the sql table is massive this isn't feasible but,
query = """
SELECT id, x
FROM a
"""
a = pd.read_sql(query, conn)
df = # some dataframe in memory
pd.merge(df, a, on='id', how='left')
see https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html for good docs about sql and pandas similarities

Insert pandas dataframe into actian PSQL database table using python

I want to import data of file "save.csv" into my actian PSQL database table "new_table" but i got error
ProgrammingError: ('42000', "[42000] [PSQL][ODBC Client Interface][LNA][PSQL][SQL Engine]Syntax Error: INSERT INTO 'new_table'<< ??? >> ('name','address','city') VALUES (%s,%s,%s) (0) (SQLPrepare)")
Below is my code:
connection = 'Driver={Pervasive ODBC Interface};server=localhost;DBQ=DEMODATA'
db = pyodbc.connect(connection)
c=db.cursor()
#create table i.e new_table
csv = pd.read_csv(r"C:\Users\user\Desktop\save.csv")
for row in csv.iterrows():
insert_command = """INSERT INTO new_table(name,address,city) VALUES (row['name'],row['address'],row['city'])"""
c.execute(insert_command)
c.commit()
Pandas have a built-in function that empty a pandas-dataframe into a sql-database called pd.to_sql(). This might be what you are looking for. Using this you dont have to manually insert one row at a time but you can insert the entire dataframe at once.
If you want to keep using your method, the issue might be that the table "new_table" hasn't been created yet in the database. And thus you first need something like this:
CREATE TABLE new_table
(
Name [nvarchar](100) NULL,
Address [nvarchar](100) NULL,
City [nvarchar](100) NULL
)
EDIT:
You can use to_sql() like this on tables that already exist in the database:
df.to_sql(
"new_table",
schema="name_of_the_schema",
con=c.session.connection(),
if_exists="append", # <--- This will append an already existing table
chunksize=10000,
index=False,
)
I have tried the same, in my case the table is created , I just want to insert each row from pandas dataframe into the database using Actian PSQL

Build normalised MSSQL dB from CSV files in Python + Pandas + sqlAlchemy

I am learning by doing - Python, Pandas, SQL & Deep Learning. I want to build a database with data for a deep learning experiment (with Keras and Tensorflow). The source data is ~10GB (total) of forex timestamped bid/ask data in 8 CSV files with source information encoded as three 3-4 char strings for categories Contributor, Region and City.
I can connect to my (empty) MSSQL database via pyodbc and sqlAlchemy; I can read my CSV files into dataframes; I can create a simple table in the dB and even create one from a simple dataframe; I can convert the date and time fields into the milliseconds since epoch I want. (And, FWIW, I already have already implemented a working toy LSTM model to adapt to the price data, and I also have some analytical functions I wrote and compiled in Mathematica; I'll either call the C from Python or get Mathematica to work directly on the database.)
The issue is putting the CSV data into the database. Since there are only a dozen or so different sources in each category I believe I should put Contributor etc. into separate tables with e.g Contributor_ID as ints (?) so that data is stored compactly and e.g. SELECT... WHERE Region = "SHRUBBERY" are efficient. (AFAICT I definitely shouldn't use enums because I may get more sources & categories later).
My question is - assuming the aforementioned high level of ignorance! - how can/should I a) create the tables and relationships using python and then b) populate those tables?
Optional extra: to save space, the CSV files omit the Region and City where the row values are the same as those for the row above - reading the CSVs to collect just the source information (which takes about 50s for each category) I know how to deduplicate and dropna, but when I want to populate the dB, how can I most efficiently replace the na's with the values from the previous row? A simple For loop would do it, but is there e.g. some way to "propagate" the last "real" value in a column to replace the na using pandas?
CSV example:
Date Time Bid Price Ask Price Contributor Region City
04/02/2017 00:00.1 1.00266 1.00282 DCFX ASI AKL
04/02/2017 00:00.1 1.00263 1.0028 DCFX
04/02/2017 00:00.2 1.00224 1.00285 FXN NAM NYC
04/02/2017 00:00.2 1.00223 1.00288 FXN
All input gratefully received :)
Relational databases (RDBMS) aim to store data into related, logical groupings with a system of primary key/foreign keys to normalize storage which among other advantages maintains referential integrity (i.e., no orphaned records) and avoids repetition of stored data. For your situation, consider the following:
DATABASE DESIGN: Understand the workflow or "story" of your data pieces (e.g., which comes first/after in data entry) and construct the necessary schema of tables. Classic Database 101 example is the Customers-Products-Orders where many customers can purchase multiple products to fill many orders (1-to-many and many-to-many relationships) where primary keys of parent tables are the foreign key of child tables. Hence, aim for a schema layout as below from this SO answer.
For your needs, your schema may involve Contributors, Regions, Cities, Markets, Company (Ticker), and Prices. This step will make use of DDL commands (CREATE TABLE, CREATE INDEX, CREATE SCHEMA) which can be run in pyodbc cursors or sqlAlchemy engine calls, sufficing the connected user has such privileges.
But typically, database design commands are run in a specialized admin console/IDE or command line tools and not application layer code like Python such as SQL Server's Management Studio or sqlcmd; similarly, Oracle's SQL Developer/sqlplus, MySQL's Workbench/cli or PostgreSQL's PgAdmin/psql. Below is example of setup for Prices table:
# INITIALIZE SQLALCHEMY ENGINE
connection_string = 'mssql+pyodbc://{}:{}#{}/{}'\
.format(db_user,db_password,db_server,db_database)
engine = create_engine(connection_string)
sql = """
CREATE TABLE Prices (
ID INT IDENTITY(1,1) PRIMARY KEY,
DateTime DATETIME,
BidPrice DOUBLE(10,4),
AskPrice DOUBLE(10,4),
ContributorID INT,
RegionID INT,
CityID INT,
CONSTRAINT FK_Contributor FOREIGN KEY (ContributorID) REFERENCES Contributors (ID),
CONSTRAINT FK_Region FOREIGN KEY (RegionID) REFERENCES Regions (ID),
CONSTRAINT FK_City FOREIGN KEY (CityID) REFERENCES Cities (ID)
)
"""
# SQL ACTION QUERY VIA TRANSACTION
with engine.begin() as conn:
conn.execute(sql)
DATA POPULATION: Because a dataset/dataframe, csv, or spreadsheet are NOT equivalent to a normalized RDBMS table but are actually queries of multiple tables, migration of these sources will require some SQL wrangling to align to your above schema. Simple upload of dataframes into SQL Server tables will lead to inefficient and repetitive storage. Therefore, consider below steps:
Staging Tables (using to_sql)
Use staging, temp tables which would be raw dumps from pandas. And for NAs issue, use DataFrame or Series forward fill , ffill, for populating NAs from above rows.
# FILL IN NAs IN ALL COLUMNS FROM PREVIOUS ROW
df = df.ffill() # OR df.fillna(method='ffill')
# FILL IN NAs FOR SPECIFIC COLUMNS
df['Region'] = df['Region'].ffill()
df['City'] = df['City'].ffill()
# DUMP DATA INTO DATA FRAME
df.to_sql(name='pandas_prices_dump', con=engine, if_exists='replace', index=False)
Migration to Final Tables (joining lookup tables by string names)
Then, run action queries (i.e., DML commands: INSERT INTO, UPDATE, DELETE) for populating final tables from staging, temp tables.
sql = """
INSERT INTO Prices (Datetime, BidPrice, AskPrice,
ContributorID, RegionID, CityID)
SELECT pd.Datetime, pd.BidPrice, pd.AskPrice, c.ID, r.ID, cy.ID
FROM pandas_prices_dump pd
INNER JOIN Contributors c
ON c.ContributorName = pd.Contributor
INNER JOIN Regions r
ON r.RegionName = pd.Region
INNER JOIN Cities cy
ON cy.CityName = pd.City
"""
# APPEND FINAL DATA
with engine.begin() as conn:
conn.execute(sql)
# DROP STAGING TABLE
with engine.begin() as conn:
conn.execute("DROP TABLE pandas_prices_dump")
Test/Check Final Tables (using read_sql, joining lookup tables by IDs)
# IMPORT INTO PANDAS (EQUIVALENT TO ORIGINAL df)
sql = """
SELECT p.Datetime, p.BidPrice, p.AskPrice,
c.ContributorName As Contributor, r.RegionName As Region,
cy.CityName As City
FROM Prices p
INNER JOIN Contributors c
ON c.ID = pd.ContributorID
INNER JOIN Regions r
ON r.ID = pd.RegionID
INNER JOIN Cities cy
ON cy.ID = pd.CityID
"""
prices_data = pd.read_sql(sql, engine)

Categories