How to upsert pandas DataFrame to MySQL with SQLAlchemy

How to upsert pandas DataFrame to MySQL with SQLAlchemy - python

I'm pushing data from a data-frame into MySQL, right now it is only adding new data to the table if the data does not exists(appending). This works perfect, however I also want my code to check if the record already exists then it needs to update. So I need it to append + update. I really don't know how to start fixing this as I got stuck....someone tried this before?
This is my code:
engine = create_engine("mysql+pymysql://{user}:{pw}#localhost/{db}"
.format(user="root",
pw="*****",
db="my_db"))
my_df.to_sql('my_table', con = engine, if_exists = 'append')

You can use next solution on DB side:
First: create table for insert data from Pandas (let call it test):
CREATE TABLE `test` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`name` VARCHAR(100) NOT NULL,
`capacity` INT(11) NOT NULL,
PRIMARY KEY (`id`)
);
Second: Create table for resulting data (let call it cumulative_test) exactly same structure as test:
CREATE TABLE `cumulative_test` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`name` VARCHAR(100) NOT NULL,
`capacity` INT(11) NOT NULL,
PRIMARY KEY (`id`)
);
Third: set trigger on each insert into the test table will insert ore update record in the second table like:
DELIMITER $$
CREATE
/*!50017 DEFINER = 'root'#'localhost' */
TRIGGER `before_test_insert` BEFORE INSERT ON `test`
FOR EACH ROW BEGIN
DECLARE _id INT;
SELECT id INTO _id
FROM `cumulative_test` WHERE `cumulative_test`.`name` = new.name;
IF _id IS NOT NULL THEN
UPDATE cumulative_test
SET `cumulative_test`.`capacity` = `cumulative_test`.`capacity` + new.capacity;
ELSE
INSERT INTO `cumulative_test` (`name`, `capacity`)
VALUES (NEW.name, NEW.capacity);
END IF;
END;
$$
DELIMITER ;
So you will already insert values into the test table and get calculated results in the second table. The logic inside the trigger can be matched for your needs.

Similar to the approach used for PostgreSQL here, you can use INSERT … ON DUPLICATE KEY in MySQL:
with engine.begin() as conn:
# step 0.0 - create test environment
conn.execute(sa.text("DROP TABLE IF EXISTS main_table"))
conn.execute(
sa.text(
"CREATE TABLE main_table (id int primary key, txt varchar(50))"
)
)
conn.execute(
sa.text(
"INSERT INTO main_table (id, txt) VALUES (1, 'row 1 old text')"
)
)
# step 0.1 - create DataFrame to UPSERT
df = pd.DataFrame(
[(2, "new row 2 text"), (1, "row 1 new text")], columns=["id", "txt"]
)
# step 1 - create temporary table and upload DataFrame
conn.execute(
sa.text(
"CREATE TEMPORARY TABLE temp_table (id int primary key, txt varchar(50))"
)
)
df.to_sql("temp_table", conn, index=False, if_exists="append")
# step 2 - merge temp_table into main_table
conn.execute(
sa.text(
"""\
INSERT INTO main_table (id, txt)
SELECT id, txt FROM temp_table
ON DUPLICATE KEY UPDATE txt = VALUES(txt)
"""
)
)
# step 3 - confirm results
result = conn.execute(
sa.text("SELECT * FROM main_table ORDER BY id")
).fetchall()
print(result) # [(1, 'row 1 new text'), (2, 'new row 2 text')]

Related

How to insert values into a postgresql database with serial id using sqlalchemy

I have a function that I use to update tables in PostgreSQL. It works great to avoid duplicate insertions by creating a temp table and dropping it upon completion. However, I have a few tables with serial ids and I have to pass the serial id in a column. Otherwise, I get an error that the keys are missing. How can I insert values in those tables and have the serial key get assigned automatically? I would prefer to modify the function below if possible.
def export_to_sql(df, table_name):
from sqlalchemy import create_engine
engine = create_engine(f'postgresql://{user}:{password}#{host}:5432/{user}')
df.to_sql(con=engine,
name='temporary_table',
if_exists='append',
index=False,
method = 'multi')
with engine.begin() as cnx:
insert_sql = f'INSERT INTO {table_name} (SELECT * FROM temporary_table) ON CONFLICT DO NOTHING; DROP TABLE temporary_table'
cnx.execute(insert_sql)
code used to create the tables
CREATE TABLE symbols
(
symbol_id serial NOT NULL,
symbol varchar(50) NOT NULL,
CONSTRAINT PK_symbols PRIMARY KEY ( symbol_id )
);
CREATE TABLE tweet_symols(
tweet_id varchar(50) REFERENCES tweets,
symbol_id int REFERENCES symbols,
PRIMARY KEY (tweet_id, symbol_id),
UNIQUE (tweet_id, symbol_id)
);
CREATE TABLE hashtags
(
hashtag_id serial NOT NULL,
hashtag varchar(140) NOT NULL,
CONSTRAINT PK_hashtags PRIMARY KEY ( hashtag_id )
);
CREATE TABLE tweet_hashtags
(
tweet_id varchar(50) NOT NULL,
hashtag_id integer NOT NULL,
CONSTRAINT FK_344 FOREIGN KEY ( tweet_id ) REFERENCES tweets ( tweet_id )
);
CREATE INDEX fkIdx_345 ON tweet_hashtags
(
tweet_id
);

The INSERT statement does not define the target columns, so Postgresql will attempt to insert values into a column that was defined as SERIAL.
We can work around this by providing a list of target columns, omitting the serial types. To do this we use SQLAlchemy to fetch the metadata of the table that we are inserting into from the database, then make a list of target columns. SQLAlchemy doesn't tell us if a column was created using SERIAL, but we will assume that it is if it is a primary key and is set to autoincrement. Primary key columns defined with GENERATED ... AS IDENTITY will also be filtered out - this is probably desirable as they behave in the same way as SERIAL columns.
import sqlalchemy as sa
def export_to_sql(df, table_name):
engine = sa.create_engine(f'postgresql://{user}:{password}#{host}:5432/{user}')
df.to_sql(con=engine,
name='temporary_table',
if_exists='append',
index=False,
method='multi')
# Fetch table metadata from the database
table = sa.Table(table_name, sa.MetaData(), autoload_with=engine)
# Get the names of columns to be inserted,
# assuming auto-incrementing PKs are serial types
column_names = ','.join(
[f'"{c.name}"' for c in table.columns
if not (c.primary_key and c.autoincrement)]
)
with engine.begin() as cnx:
insert_sql = sa.text(
f'INSERT INTO {table_name} ({column_names}) (SELECT * FROM temporary_table) ON CONFLICT DO NOTHING; DROP TABLE temporary_table'
)
cnx.execute(insert_sql)

Bulk Saving and Updating while returning IDs

So I'm using sqlalchemy for a project I'm working on. I've got an issue where I will eventually have thousands of records that need to be saved every hour. These records may be inserted or updated. I've been using bulk_save_objects for this and it's worked great. However now I have to introduce a history to these records being saved, which means I need the IDs returned so I can link these entries to an entry in a history table. I know about using return_defaults, and that works. However, it introduces a problem that my bulk_save_objects inserts and updates one entry at a time, instead of in bulk, which removes the purpose. Is there another option, where I can bulk insert and update at the same time, but retain the IDs?

The desired result can be achieved using a technique similar to the one described in the answer here by uploading the rows to a temporary table and then performing an UPDATE followed by an INSERT that returns the inserted ID values. For SQL Server, that would be an OUTPUT clause on the INSERT statement:
main_table = "team"
# <set up test environment>
with engine.begin() as conn:
conn.execute(sa.text(f"DROP TABLE IF EXISTS [{main_table}]"))
conn.execute(
sa.text(
f"""
CREATE TABLE [dbo].[{main_table}](
[id] [int] IDENTITY(1,1) NOT NULL,
[prov] [varchar](2) NOT NULL,
[city] [varchar](50) NOT NULL,
[name] [varchar](50) NOT NULL,
[comments] [varchar](max) NULL,
CONSTRAINT [PK_team] PRIMARY KEY CLUSTERED
(
[id] ASC
)
)
"""
)
)
conn.execute(
sa.text(
f"""
CREATE UNIQUE NONCLUSTERED INDEX [UX_team_prov_city] ON [dbo].[{main_table}]
(
[prov] ASC,
[city] ASC
)
"""
)
)
conn.execute(
sa.text(
f"""
INSERT INTO [{main_table}] ([prov], [city], [name])
VALUES ('AB', 'Calgary', 'Flames')
"""
)
)
# <data for upsert>
df = pd.DataFrame(
[
("AB", "Calgary", "Flames", "hard-working, handsome lads"),
("AB", "Edmonton", "Oilers", "ruffians and scalawags"),
],
columns=["prov", "city", "name", "comments"],
)
# <perform upsert, returning IDs>
temp_table = "#so65525098"
with engine.begin() as conn:
df.to_sql(temp_table, conn, index=False, if_exists="replace")
conn.execute(
sa.text(
f"""
UPDATE main SET main.name = temp.name,
main.comments = temp.comments
FROM [{main_table}] main INNER JOIN [{temp_table}] temp
ON main.prov = temp.prov AND main.city = temp.city
"""
)
)
inserted = conn.execute(
sa.text(
f"""
INSERT INTO [{main_table}] (prov, city, name, comments)
OUTPUT INSERTED.prov, INSERTED.city, INSERTED.id
SELECT prov, city, name, comments FROM [{temp_table}] temp
WHERE NOT EXISTS (
SELECT * FROM [{main_table}] main
WHERE main.prov = temp.prov AND main.city = temp.city
)
"""
)
).fetchall()
print(inserted)
"""console output:
[('AB', 'Edmonton', 2)]
"""
# <check results>
with engine.begin() as conn:
pprint(conn.execute(sa.text(f"SELECT * FROM {main_table}")).fetchall())
"""console output:
[(1, 'AB', 'Calgary', 'Flames', 'hard-working, handsome lads'),
(2, 'AB', 'Edmonton', 'Oilers', 'ruffians and scalawags')]
"""

I can't create table by getting column names from a list?(postgresql/psycopg2)

I have prepared two sample lists below. My goal is to create a table with these two lists in postgresql.id will be bigserial primary key.but I keep getting errors. how do you think i can do that?
My example list and code:
my_column_name = ['id','first name','surname','age']
data= [{'Jimmy', 'wallece', 17}]
connection = psycopg2.connect(user = "postgres",
password = "Sabcanuy.1264",
host="127.0.0.1",
port="5432",
database="postgres")
cursor = connection.cursor()
create_table_query = '''CREATE TABLE unit_category_report (ID BIGSERIAL PRIMARY KEY ,
my_columne_name); '''

Strings cannot access variables and their values.
I’m not 100% sure this will work but you can try:
my_column_name =['id','first_name','surname','age']
create_table_query = '''CREATE TABLE unit_category_report (ID BIGSERIAL PRIMARY KEY , %s); ''' % (my_column_name)
Or...
create_table_query = '''CREATE TABLE unit_category_report (ID BIGSERIAL PRIMARY KEY , {0}); '''.format(my_column_name)
You may have to switch to double quotes from the triple single quote.

How to import csv data with parent/child (category-subcategory) hierarchy to MySQL using Python

I am importing a csv file containing a parent/child (category-subcategory) hierarchy to MySQL, using Python's MySQLdb module. Here is an example csv file:
vendor,category,subcategory,product_name,product_model,product_price
First vendor,category1,subcategory1,product1,model1,100
First vendor,category1,subcategory2,product2,model2,110
First vendor,category2,subcategory3,product3,model3,130
First vendor,category2,subcategory4,product5,model7,190
In MySQL I want to use a category table with a hierarchical structure, like this:
CREATE TABLE IF NOT EXISTS `category` (
`category_id` int(11) NOT NULL AUTO_INCREMENT,
`parent_id` int(11) NOT NULL DEFAULT '0',
`status` tinyint(1) NOT NULL,
PRIMARY KEY (`category_id`),
KEY `parent_id` (`parent_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
My question is: How do I determine the parent_id in this table?
Here is the Python script I have so far:
import MySQLdb
import csv
con = MySQLdb.connect('localhost', 'root', '', 'testdb', use_unicode=True, charset='utf8')
with con:
cur = con.cursor()
csv_data = csv.reader(file('test.csv'))
csv_data.next()
for row in csv_data:
cur.execute("SELECT manufacturer_id FROM manufacturer WHERE name=%s", [row[0]],)
res = cur.fetchall()
if res:
vendor_id = res[0][0]
else:
cur.execute("INSERT INTO manufacturer (name) VALUES (%s)", (row[0],))
vendor_id = cur.lastrowid
cur.execute("SELECT category_id FROM category_description WHERE name=%s", [row[2]])
res = cur.fetchall()
if res:
category_id = res[0][0]
else:
# What parent_id should be inserted here?
cur.execute("INSERT INTO category (`status`, `parent_id`) VALUES (%s,%s)", (1,))
category_id = cur.lastrowid
cur.execute("INSERT INTO category_description (category_id, name) VALUES (%s,%s)", (category_id,row[2],))
cur.execute("INSERT INTO product (model, manufacturer_id, price,) VALUES (%s, %s, %s)", (row[4], `vendor_id`, row[8],))
product_id = cur.lastrowid
cur.execute("INSERT INTO product_to_category (product_id, category_id) VALUES (%s, %s)", (product_id, category_id,))
cur.commit()
Here are the definitions of the other tables used in my example:
CREATE TABLE IF NOT EXISTS `manufacturer` (
`manufacturer_id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(64) NOT NULL,
PRIMARY KEY (`manufacturer_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
CREATE TABLE IF NOT EXISTS `category_description` (
`category_id` int(11) NOT NULL,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`category_id`,`language_id`),
KEY `name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
CREATE TABLE IF NOT EXISTS `product` (
`product_id` int(11) NOT NULL AUTO_INCREMENT,
`model` varchar(64) NOT NULL,
`manufacturer_id` int(11) NOT NULL,
`price` decimal(15,4) NOT NULL DEFAULT '0.0000',
PRIMARY KEY (`product_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
CREATE TABLE IF NOT EXISTS `product_to_category` (
`product_id` int(11) NOT NULL,
`category_id` int(11) NOT NULL,
PRIMARY KEY (`product_id`,`category_id`),
KEY `category_id` (`category_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;

In a hierarchical table structure, any member at the top of its hierarchy has no parents. I would probably show this with a NULL parent ID but based on the way you've defined your category table, it looks like you want to show this by giving the value 0 for the parent ID.
Since you have fixed-depth hierarchies with only two levels (category and subcategory), the task is relatively simple. For each row of the CSV data, you need to:
Check whether the parent (row[1]) is in the table; if not, insert it with a parent ID of 0.
Get the category_id of the parent from step 1.
Check whether the child (row[2]) is in the table; if not, insert it with a parent ID equal to the category_id from step 2.
In your example code, you never access the parent (row[1]); you need to insert this into the table for it to have an ID that the child can refer to. If you've already inserted the parents before this point, you should probably still check to make sure it's there.
You have some other problems here:
The PK of your category_description table is defined on a column that you forgot to define in the table (language_id).
You should really be using InnoDB in this physical model so that you can enforce foreign key constraints in category_description, product and product_to_category.
In your example, cur.commit() is going to throw an exception – that's a method of the Connection object in MySQLdb. Of course, COMMIT isn't implemented for MyISAM tables anyway, so you could also avoid the exception by removing the line entirely.
Referencing row[8] is also going to throw an exception, according to the CSV data you've shown us. (This is a good example of why you should test your MCVE to make sure it works!)
If you do switch to InnoDB – and you probably should – you can use with con as cur: to get a cursor that commits itself when you exit the with block. This saves a couple lines of code and lets you manage transactions without micromanaging the connection object.

What's wrong with my python + sqlite3 code in creating tables?

I'm trying to create a database with several tables connecting to each other using foreign keys using sqlite3, and I'm writing in python.
Here is my code:
db = sqlite3.connect("PHLC.db")
cur = db.cursor()
# ############################
# delete original table if exist
# drop from the end (foreign key issue)
cur.execute("drop table if exists measurement")
cur.execute("drop table if exists mouse")
cur.execute("drop table if exists drug")
cur.execute("drop table if exists batch")
cur.execute("drop table if exists phlc")
# ############################
# create table
# ############################
# 1. phlc
cur.execute(
"""
CREATE TABLE phlc (
phlc_id INTEGER NOT NULL PRIMARY KEY,
cancer VARCHAR(30) NOT NULL,
histology VARCHAR(60) NOT NULL
)
"""
)
# 2. batch
cur.execute(
"""
CREATE TABLE batch (
batch_id INTEGER PRIMARY KEY AUTOINCREMENT,
phlc_id INTEGER NOT NULL,
FOREIGN KEY (phlc_id) REFERENCES phlc (phlc_id),
batch_number INTEGER NOT NULL
)
"""
)
# 3. drug
cur.execute(
"""
CREATE TABLE drug (
drug_id INTEGER PRIMARY KEY AUTOINCREMENT,
drug_name VARCHAR(30) NOT NULL,
batch_id INTEGER NOT NULL,
FOREIGN KEY (batch_id) REFERENCES batch (batch_id)
)
"""
)
# 4. mouse
cur.execute(
"""
CREATE TABLE mouse (
mouse_id INTEGER PRIMARY KEY AUTOINCREMENT,
drug_id INTEGER NOT NULL,
FOREIGN KEY (drug_id) REFERENCES drug (drug_id)
)
"""
)
# 5. measurement
cur.execute(
"""
CREATE TABLE measurement (
measurement_index INTEGER PRIMARY KEY AUTOINCREMENT,
mouse_id INTEGER NOT NULL,
FOREIGN KEY (mouse_id) REFERENCES mouse (mouse_id),
day INTEGER NOT NULL,
tumor_volume FLOAT NOT NULL,
comment VARCHAR(255) NULL
)
"""
)
db.commit()
db.close()
The error I'm getting is at the batch table:
sqlite3.OperationalError: near "batch_number": syntax error
Can someone point out the problem with the code? (It worked fine with MySQL..)

According to the documentation, any table constraints must come after all column definitions:
CREATE TABLE batch (
batch_id INTEGER PRIMARY KEY AUTOINCREMENT,
phlc_id INTEGER NOT NULL,
batch_number INTEGER NOT NULL,
FOREIGN KEY (phlc_id) REFERENCES phlc (phlc_id)
)
Alternatively, make the foreign key declaration a column constraint:
CREATE TABLE batch (
batch_id INTEGER PRIMARY KEY AUTOINCREMENT,
phlc_id INTEGER NOT NULL REFERENCES phlc (phlc_id),
batch_number INTEGER NOT NULL
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.