Dataframe to PostgreSQL DB

Dataframe to PostgreSQL DB - python

I query 4hrs data from source PLC MS SQL db, process it with python and write the data to main Postgresql table.
While writing to main Postgres table hourly, there is a duplicate value (previous 3 hrs) -it will result in error (primary key) and prevent the transaction & python error.
So,
I create a temp PostgreSQL table without any key every time hourly
Then copy pandas dataframe to temp table
Then insert rows from temp table --> main PostgreSQL table
Drop temp PostgreSQL table
This python script runs in windows task scheduler hourly
Below is my query.
engine = create_engine('postgresql://postgres:postgres#host:port/dbname?gssencmode=disable')
conn = engine.raw_connection()
cur = conn.cursor()
cur.execute("""CREATE TABLE public.table_temp
(
datetime timestamp without time zone NOT NULL,
tagid text COLLATE pg_catalog."default" NOT NULL,
mc text COLLATE pg_catalog."default" NOT NULL,
value text COLLATE pg_catalog."default",
quality text COLLATE pg_catalog."default"
)
TABLESPACE pg_default;
ALTER TABLE public.table_temp
OWNER to postgres;""");
output = io.StringIO()
df.to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur.copy_from(output, 'table_temp', null="")
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
cur.execute("""DROP TABLE table_temp CASCADE;""");
conn.commit()
I would like to know if there is any efficient/faster way to do it

If I'm correct in assuming that the data is in the data frame you should just be able to do
engine = create_engine('postgresql://postgres:postgres#host:port/dbname?gssencmode=disable')
df.drop_duplicates(subset=None) # Replace None with list of column names that define the primary key ex. ['column_name1', 'column_name2']
df.to_sql('table_main', engine, if_exists='append')
Edit due to comment:
If that's the case you have the right idea. You can make it more efficient by using to_sql to insert the data into the temp table first like so.
engine = create_engine('postgresql://postgres:postgres#host:port/dbname?gssencmode=disable')
df.to_sql('table_temp', engine, if_exists='replace')
cur.execute("""Insert into public.table_main select * From table_temp ON CONFLICT DO NOTHING;""");
# cur.execute("""DROP TABLE table_temp CASCADE;"""); # You can drop if you want to but the replace option in to_sql will drop and recreate the table
conn.commit()

Related

How to insert values into a postgresql database with serial id using sqlalchemy

I have a function that I use to update tables in PostgreSQL. It works great to avoid duplicate insertions by creating a temp table and dropping it upon completion. However, I have a few tables with serial ids and I have to pass the serial id in a column. Otherwise, I get an error that the keys are missing. How can I insert values in those tables and have the serial key get assigned automatically? I would prefer to modify the function below if possible.
def export_to_sql(df, table_name):
from sqlalchemy import create_engine
engine = create_engine(f'postgresql://{user}:{password}#{host}:5432/{user}')
df.to_sql(con=engine,
name='temporary_table',
if_exists='append',
index=False,
method = 'multi')
with engine.begin() as cnx:
insert_sql = f'INSERT INTO {table_name} (SELECT * FROM temporary_table) ON CONFLICT DO NOTHING; DROP TABLE temporary_table'
cnx.execute(insert_sql)
code used to create the tables
CREATE TABLE symbols
(
symbol_id serial NOT NULL,
symbol varchar(50) NOT NULL,
CONSTRAINT PK_symbols PRIMARY KEY ( symbol_id )
);
CREATE TABLE tweet_symols(
tweet_id varchar(50) REFERENCES tweets,
symbol_id int REFERENCES symbols,
PRIMARY KEY (tweet_id, symbol_id),
UNIQUE (tweet_id, symbol_id)
);
CREATE TABLE hashtags
(
hashtag_id serial NOT NULL,
hashtag varchar(140) NOT NULL,
CONSTRAINT PK_hashtags PRIMARY KEY ( hashtag_id )
);
CREATE TABLE tweet_hashtags
(
tweet_id varchar(50) NOT NULL,
hashtag_id integer NOT NULL,
CONSTRAINT FK_344 FOREIGN KEY ( tweet_id ) REFERENCES tweets ( tweet_id )
);
CREATE INDEX fkIdx_345 ON tweet_hashtags
(
tweet_id
);

The INSERT statement does not define the target columns, so Postgresql will attempt to insert values into a column that was defined as SERIAL.
We can work around this by providing a list of target columns, omitting the serial types. To do this we use SQLAlchemy to fetch the metadata of the table that we are inserting into from the database, then make a list of target columns. SQLAlchemy doesn't tell us if a column was created using SERIAL, but we will assume that it is if it is a primary key and is set to autoincrement. Primary key columns defined with GENERATED ... AS IDENTITY will also be filtered out - this is probably desirable as they behave in the same way as SERIAL columns.
import sqlalchemy as sa
def export_to_sql(df, table_name):
engine = sa.create_engine(f'postgresql://{user}:{password}#{host}:5432/{user}')
df.to_sql(con=engine,
name='temporary_table',
if_exists='append',
index=False,
method='multi')
# Fetch table metadata from the database
table = sa.Table(table_name, sa.MetaData(), autoload_with=engine)
# Get the names of columns to be inserted,
# assuming auto-incrementing PKs are serial types
column_names = ','.join(
[f'"{c.name}"' for c in table.columns
if not (c.primary_key and c.autoincrement)]
)
with engine.begin() as cnx:
insert_sql = sa.text(
f'INSERT INTO {table_name} ({column_names}) (SELECT * FROM temporary_table) ON CONFLICT DO NOTHING; DROP TABLE temporary_table'
)
cnx.execute(insert_sql)

SQLite and python - can't set primary keys

I'm trying to create tables using python but when I inspect the data structure in SQLite, the primary keys aren't being assigned. Here's the code for one of the tables. It seems to work as intended except for the primary key part. I'm new to Python and SQLite so I'm probably missing something very obvious but can't find any answers.
# Create a database and connect
conn = sql.connect('Coursework.db')
c = conn.cursor()
# Create the tables from the normalised schema
c.execute('CREATE TABLE IF NOT EXISTS room_host (room_ID integer PRIMARY KEY, host_ID integer)')
c.execute("SELECT count(name) from sqlite_master WHERE type='table' AND name='room_host'")
if c.fetchone()[0] == 1:
c.execute("DROP TABLE room_host")
else:
c.execute('CREATE TABLE room_host (room_ID integer PRIMARY KEY, host_ID integer)')
conn.commit()
# read data from csv
read_listings = pd.read_csv('listings.csv')
room_host = pd.DataFrame(read_listings, columns=['id', 'host_id'])
room_host.set_index('id')
room_host.to_sql("room_host", conn, if_exists='append', index=False)
c.execute("""INSERT INTO room_host (id, host_ID)
SELECT room_host.id, room_host.host_ID
FROM room_host
""")

I can't reporoduce the issue with the primary key, the table is created as expected when I run that SQL statement.
Other than that, the detour through Pandas is not really necessary, the csv module plus .executemany() seems to me as a much more straight-forward way of loading data from a CSV into a table.
import csv
import sqlite3 as sql
conn = sql.connect('Coursework.db')
conn.executescript('CREATE TABLE IF NOT EXISTS room_host (room_ID integer PRIMARY KEY, host_ID integer)')
conn.commit()
with open('listings.csv', encoding='utf8', newline='') as f:
reader = csv.reader(f, delimiter=',')
conn.executemany('INSERT INTO room_host (room_ID, host_ID) VALUES (?, ?)', reader)
conn.commit()

Database ER diagram not showing relationships even though specified

I have created a sqlite database. Even though I have included the the relationship between the primary and foreign keys, when I am generating the ER diagram I am not able to see the connections between them. I am using datagrip to create the diagram. I tested other databases in datagrip and dbvisualizer and i do not have any problems with them but only in this.
ER diagram -
This is the script i used for creating two tables in the database -
def create_titles_table():
# connect to the database
conn = sqlite3.connect("imdb.db")
# create a cursor
c = conn.cursor()
print()
print("Creating titles table...")
c.execute(
"""CREATE TABLE IF NOT EXISTS titles
(titleId TEXT NOT NULL, titleType TEXT,
primaryTitle TEXT, originalTitle TEXT,
isAdult INTEGER, startYear REAL,
endYear REAL, runtimeMinutes REAL,
PRIMARY KEY (titleId)
)
"""
)
# commit changes
conn.commit()
# read the title data
df = load_data("title.basics.tsv")
# replace \N with nan
df.replace("\\N", np.nan, inplace=True)
# rename columns
df.rename(columns={"tconst": "titleId"}, inplace=True)
# drop the genres column
title_df = df.drop("genres", axis=1)
# convert the data types from str to numeric
title_df["startYear"] = pd.to_numeric(title_df["startYear"], errors="coerce")
title_df["endYear"] = pd.to_numeric(title_df["endYear"], errors="coerce")
title_df["runtimeMinutes"] = pd.to_numeric(
title_df["runtimeMinutes"], errors="coerce"
)
# insert the data into titles table
title_df.to_sql("titles", conn, if_exists="replace", index=False)
# commit changes
conn.commit()
# close the connection
conn.close()
print("Completed!")
print()
def create_ratings_table():
# connect to the database
conn = sqlite3.connect("imdb.db")
# create a cursor
c = conn.cursor()
print()
print("Creating ratings table...")
c.execute(
"""CREATE TABLE IF NOT EXISTS ratings
(titleId TEXT NOT NULL, averageRating REAL, numVotes INTEGER,
FOREIGN KEY (titleId) REFERENCES titles(titleId)
)
"""
)
# commit changes
conn.commit()
# read the data
df = load_data("title.ratings.tsv")
df.rename(columns={"tconst": "titleId"}, inplace=True)
# insert the data into the ratings table
df.to_sql("ratings", conn, if_exists="replace", index=False)
# commit changes
conn.commit()
# close the connection
conn.close()
print("Completed!")
print()
Can anyone tell me where am i making the mistake?

How to upsert pandas DataFrame to PostgreSQL table?

I've scraped some data from web sources and stored it all in a pandas DataFrame. Now, in order harness the powerful db tools afforded by SQLAlchemy, I want to convert said DataFrame into a Table() object and eventually upsert all data into a PostgreSQL table. If this is practical, what is a workable method of going about accomplishing this task?

Update: You can save yourself some typing by using this method.
If you are using PostgreSQL 9.5 or later you can perform the UPSERT using a temporary table and an INSERT ... ON CONFLICT statement:
import sqlalchemy as sa
# …
with engine.begin() as conn:
# step 0.0 - create test environment
conn.exec_driver_sql("DROP TABLE IF EXISTS main_table")
conn.exec_driver_sql(
"CREATE TABLE main_table (id int primary key, txt varchar(50))"
)
conn.exec_driver_sql(
"INSERT INTO main_table (id, txt) VALUES (1, 'row 1 old text')"
)
# step 0.1 - create DataFrame to UPSERT
df = pd.DataFrame(
[(2, "new row 2 text"), (1, "row 1 new text")], columns=["id", "txt"]
)
# step 1 - create temporary table and upload DataFrame
conn.exec_driver_sql(
"CREATE TEMPORARY TABLE temp_table AS SELECT * FROM main_table WHERE false"
)
df.to_sql("temp_table", conn, index=False, if_exists="append")
# step 2 - merge temp_table into main_table
conn.exec_driver_sql(
"""\
INSERT INTO main_table (id, txt)
SELECT id, txt FROM temp_table
ON CONFLICT (id) DO
UPDATE SET txt = EXCLUDED.txt
"""
)
# step 3 - confirm results
result = conn.exec_driver_sql("SELECT * FROM main_table ORDER BY id").all()
print(result) # [(1, 'row 1 new text'), (2, 'new row 2 text')]

I have needed this so many times, I ended up creating a gist for it.
The function is below, it will create the table if it is the first time persisting the dataframe and will update the table if it already exists:
import pandas as pd
import sqlalchemy
import uuid
import os
def upsert_df(df: pd.DataFrame, table_name: str, engine: sqlalchemy.engine.Engine):
"""Implements the equivalent of pd.DataFrame.to_sql(..., if_exists='update')
(which does not exist). Creates or updates the db records based on the
dataframe records.
Conflicts to determine update are based on the dataframes index.
This will set unique keys constraint on the table equal to the index names
1. Create a temp table from the dataframe
2. Insert/update from temp table into table_name
Returns: True if successful
"""
# If the table does not exist, we should just use to_sql to create it
if not engine.execute(
f"""SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'public'
AND table_name = '{table_name}');
"""
).first()[0]:
df.to_sql(table_name, engine)
return True
# If it already exists...
temp_table_name = f"temp_{uuid.uuid4().hex[:6]}"
df.to_sql(temp_table_name, engine, index=True)
index = list(df.index.names)
index_sql_txt = ", ".join([f'"{i}"' for i in index])
columns = list(df.columns)
headers = index + columns
headers_sql_txt = ", ".join(
[f'"{i}"' for i in headers]
) # index1, index2, ..., column 1, col2, ...
# col1 = exluded.col1, col2=excluded.col2
update_column_stmt = ", ".join([f'"{col}" = EXCLUDED."{col}"' for col in columns])
# For the ON CONFLICT clause, postgres requires that the columns have unique constraint
query_pk = f"""
ALTER TABLE "{table_name}" DROP CONSTRAINT IF EXISTS unique_constraint_for_upsert;
ALTER TABLE "{table_name}" ADD CONSTRAINT unique_constraint_for_upsert UNIQUE ({index_sql_txt});
"""
engine.execute(query_pk)
# Compose and execute upsert query
query_upsert = f"""
INSERT INTO "{table_name}" ({headers_sql_txt})
SELECT {headers_sql_txt} FROM "{temp_table_name}"
ON CONFLICT ({index_sql_txt}) DO UPDATE
SET {update_column_stmt};
"""
engine.execute(query_upsert)
engine.execute(f"DROP TABLE {temp_table_name}")
return True

Here is my code for bulk insert & insert on conflict update query for postgresql from pandas dataframe:
Lets say id is unique key for both postgresql table and pandas df and you want to insert and update based on this id.
import pandas as pd
from sqlalchemy import create_engine, text
engine = create_engine(postgresql://username:pass#host:port/dbname)
query = text(f"""
INSERT INTO schema.table(name, title, id)
VALUES {','.join([str(i) for i in list(df.to_records(index=False))])}
ON CONFLICT (id)
DO UPDATE SET name= excluded.name,
title= excluded.title
""")
engine.execute(query)
Make sure that your df columns must be same order with your table.
EDIT 1:
Thanks to Gord Thompson's comment, I realized that this query won't work if there is single quote in columns. Therefore here is a fix if there is single quote in columns:
import pandas as pd
from sqlalchemy import create_engine, text
df.name = df.name.str.replace("'", "''")
df.title = df.title.str.replace("'", "''")
engine = create_engine(postgresql://username:pass#host:port/dbname)
query = text("""
INSERT INTO author(name, title, id)
VALUES %s
ON CONFLICT (id)
DO UPDATE SET name= excluded.name,
title= excluded.title
""" % ','.join([str(i) for i in list(df.to_records(index=False))]).replace('"', "'"))
engine.execute(query)

Consider this function if your DataFrame and SQL Table contain the same column names and types already.
Advantages:
Good if you have a long dataframe to insert. (Batching)
Avoid writing long sql statement in your code.
Fast
.
from sqlalchemy import Table
from sqlalchemy.engine.base import Engine as sql_engine
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.ext.automap import automap_base
import pandas as pd
def upsert_database(list_input: pd.DataFrame, engine: sql_engine, table: str, schema: str) -> None:
if len(list_input) == 0:
return None
flattened_input = list_input.to_dict('records')
with engine.connect() as conn:
base = automap_base()
base.prepare(engine, reflect=True, schema=schema)
target_table = Table(table, base.metadata,
autoload=True, autoload_with=engine, schema=schema)
chunks = [flattened_input[i:i + 1000] for i in range(0, len(flattened_input), 1000)]
for chunk in chunks:
stmt = insert(target_table).values(chunk)
update_dict = {c.name: c for c in stmt.excluded if not c.primary_key}
conn.execute(stmt.on_conflict_do_update(
constraint=f'{table}_pkey',
set_=update_dict)
)

If you already have a pandas dataframe you could use df.to_sql to push the data directly through SQLAlchemy
from sqlalchemy import create_engine
#create a connection from Postgre URI
cnxn = create_engine("postgresql+psycopg2://username:password#host:port/database")
#write dataframe to database
df.to_sql("my_table", con=cnxn, schema="myschema")

inserting data into SQL cell, My SQL DB

I have created MYSQL DB using XAMPP and created table using python
self.sql = """CREATE TABLE Report_SV (
Date VARCHAR(160) ,
StartTime VARCHAR(160),
EndTime VARCHAR(160),
SystemName VARCHAR(160),
TestBench VARCHAR(160),
DomainName VARCHAR(100),
SourceFile CHAR(200),
Build_Info VARCHAR(200),
HLF VARCHAR(200),
TestcaseID VARCHAR(60),
TestCaseName VARCHAR(100),
Test_Step TEXT(1000),
Step_Result VARCHAR(500))"""
While inserting data into one of its cell my data is not shown properly and only truncated data is viewed.
sql = "INSERT INTO report_sv (Test_Step) VALUES ('FES.statusFESVehicleDynamicSwitchHMI.requestSwitchDrivingExperienceMMIDisplay=1')";
I don't know what is the proper way to insert complete data?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe to PostgreSQL DB - python

Related

How to insert values into a postgresql database with serial id using sqlalchemy

SQLite and python - can't set primary keys

Database ER diagram not showing relationships even though specified

How to upsert pandas DataFrame to PostgreSQL table?

inserting data into SQL cell, My SQL DB

Categories

Resources