Python - Bulk Select then Insert from one DB to another - python

I'm looking for some help on how to do this in Python using sqlite3
Basically I have a process which downloads a DB (temp) and then needs to insert it's records into a 2nd identical DB (the main db).. and at the same time ignore/bypass any possible duplicate key errors
I was thinking of two scenarios but am unsure how to best do this in Python
Option 1:
create 2 connections and cursor objects, 1 to each DB
select from DB 1 eg:
dbcur.executemany('SELECT * from table1')
rows = dbcur.fetchall()
insert them into DB 2:
dbcur.execute('INSERT INTO table1 VALUES (:column1, :column2)', rows)
dbcon.commit()
This of course does not work as I'm not sure how to do it properly :)
Option 2 (which I would prefer, but not sure how to do):
SELECT and INSERT in 1 statement
Also, I have 4 tables within the DB's each with varying columns, can I skip naming the columns on the INSERT statement?
As far as the duplicate keys go, I have read I can use 'ON DUPLICATE KEY' to handle
eg.
INSERT INTO table1 VALUES (:column1, :column2) ON DUPLICATE KEY UPDATE set column1=column1

You can ATTACH two databases to the same connection with code like this:
import sqlite3
connection = sqlite3.connect('/path/to/temp.sqlite')
cursor=connection.cursor()
cursor.execute('ATTACH "/path/to/main.sqlite" AS master')
There is no ON DUPLICATE KEY syntax in sqlite as there is in MySQL. This SO question contains alternatives.
So to do the bulk insert in one sql statement, you could use something like
cursor.execute('INSERT OR REPLACE INTO master.table1 SELECT * FROM table1')
See this page for information about REPLACE and other ON CONFLICT options.

The code for option 1 looks correct.
If you need filtering to bypass duplicate keys, do the insert into a temporary table and then use SQL commands to eliminate duplicates and merge them into the target table.

Related

Upsert / merge tables in SQLite

I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).

Copy row from Cassandra database and then insert it using Python

I'm using plugin DataStax Python Driver for Apache Cassandra.
I want to read 100 rows from database and then insert them again into database after changing one value. I do not want to miss previous records.
I know how to get my rows:
rows = session.execute('SELECT * FROM columnfamily LIMIT 100;')
for myrecord in rows:
print(myrecord.timestamp)
I know how to insert new rows into database:
stmt = session.prepare('''
INSERT INTO columnfamily (rowkey, qualifier, info, act_date, log_time)
VALUES (, ?, ?, ?, ?)
IF NOT EXISTS
''')
results = session.execute(stmt, [arg1, arg2, ...])
My problems are that:
I do not know how to change only one value in a row.
I don't know how to insert rows into database without using CQL. My columnfamily has more than 150 columns and writing all their names in query does not seem as a best idea.
To conclude:
Is there a way to get rows, modify one value from every one of them and then insert this rows into database without using only CQL?
First, you need to select only needed columns from Cassandra - it will be faster to transfer the data. You need to include all columns of primary key + column that you want to change.
After you get the data, you can use UPDATE command to update only necessary column (example from documentation):
UPDATE cycling.cyclist_name
SET comments ='='Rides hard, gets along with others, a real winner'
WHERE id = fb372533-eb95-4bb4-8685-6ef61e994caa
You can also use prepared statement to make it more performant...
But be careful - the UPDATE & INSERT in CQL are really UPSERTs, so if you change columns that are part of primary key, then it will create new entry...

Bulk update Postgres column from python dataframe

I am using the below python code to update postgres DB column valuebased on Id. This loop has to run for thousands of records and it is taking longer time.
Is there a way where I can pass array of dataframe values instead of looping each row?
for i in range(0,len(df)):
QUERY=""" UPDATE "Table" SET "value"='%s' WHERE "Table"."id"='%s'
""" % (df['value'][i], df['id'][i])
cur.execute(QUERY)
conn.commit()
Depends on a library you use to communicate with PostgreSQL, but usually bulk inserts are much faster via COPY FROM command.
If you use psycopg2 it is as simple as following:
cursor.copy_from(io.StringIO(string_variable), "destination_table", columns=('id', 'value'))
Where string_variable is tab and new line delimited dataset like 1\tvalue1\n2\tvalue2\n.
To achieve a performant bulk update I would do:
Create a temporary table: CREATE TEMPORARY TABLE tmp_table;;
Insert records with copy_from;
Just update destination table with query UPDATE destination_table SET value = t.value FROM tmp_table t WHERE id = t.id or any other preferred syntax

how to write sql to update some field given only one record in the target table

I got a table named test in MySQL database.
There are some fields in the test table, say, name.
However, there is only 0 or 1 record in the table.
When new record , say name = fox, comes, I'd like to update the targeted field of the table test.
I use python to handle MySQL and my question is how to write the sql.
PS. I try not to use where expression, but failed.
Suppose I've got the connection to the db, like the following:
conn = MySQLdb.connect(host=myhost, ...)
What you need here is a query which does the Merge kind of operation on your data. Algorithmically:
When record exists
do Update
Else
do Insert
You can go through this article to get a fair idea on doing things in this situation:
http://www.xaprb.com/blog/2006/06/17/3-ways-to-write-upsert-and-merge-queries-in-mysql/
What I personally recommend is the INSERT.. ON DUPLICATE KEY UPDATE
In your scenario, something like
INSERT INTO test (name)
VALUES ('fox')
ON DUPLICATE KEY UPDATE
name = 'fox';
Using this kind of a query you can handle the situation in one single shot.

Efficient insert of multiple rows with SQLAlchemy/SQLite3 when duplicate entries exist

I'm inserting multiple rows into an SQLite3 table using SQLAlchemy, and frequently the entries are already in the table. It is very slow to insert the rows one at a time, and catch the exception and continue if the row already exists. Is there an efficient way to do this? If the row already exists, I'd like to do nothing.
You can use an SQL statement
INSERT OR IGNORE INTO ... etc. ...
to simply ignore the insert if it is a duplicate. Learn about the IGNORE conflict clause here
Perhaps you can use OR IGNORE as a prefix in your SQLAlchemy Insert -- the documentation for how to place OR IGNORE between INSERT and INTO in your SQL statement is here
If you are happy to run 'native' sqlite SQL you can just do:
REPLACE INTO my_table(id, col2, ..) VALUES (1, 'value', ...);
REPLACE INTO my_table(...);
...
COMMIT
However, this won't be portable across all DBMS's and is therefore the reason that its not found in the general sqlalchemy dialect.
Another thing you could do is use the SQLAlchemy ORM, define a 'domain model' -- a python class which maps to your database table. Then you can create many instances of your domain class and call session.save_or_update(domain_object) on each of the items you wish to insert (or ignore) and finally call session.commit() when you want to insert (or ignore) the items to your database table.
This question looks like a duplicate of SQLAlchemy - INSERT OR REPLACE equivalent

Categories