Insert/delete operation for large data in python without merge

Insert/delete operation for large data in python without merge - python

I am quite new to python. I have a table that I want to update daily. I get a csv file with large amount of data, about 15000 entries. Each row from the csv file has to be inserted in my table. But If a specific value from the file matches the primary key of any of the rows, the I want to delete the row from the table and instead insert the corresponding row from the csv file. So for eg. if my csv file is like this:
001|test1|test11|test111
002|test2|test22|test222
003|test3|test33|test333
And in my table I have a row with primary key column value=002, then delete that row and insert corresponding row from the file.
I don't have an idea about how many rows I could get in that csv every day, with values matching primary key. I know this can be done with a MERGE query but I am not really sure if it will take a longer time than any other method. And it would also require me to create a temp table and truncate it every time. Same if I use WHERE EXISTS, I would need a temp table.
What is the most efficient way to do this task?
I am using Python 2.7.5 and SQL Server 2017

I think using merge statement is the optimal solution. Create a stage-table matching your target table, truncate it and insert the csv to the stage table. If your sqlserver instance has access to the file you can use bulk insert or open rowset to load it, othervise use python. To load staged data to target table use a MERGE statement.
If your table has column names Id, Col1, Col2, Col3 then something like this:
MERGE INTO dbo.MyTable as TargetTable USING
(
SELECT
Id,Col1,Col2,Col3
FROM dbo.stage_MyTable
) as SourceTable
ON TargetTable.Id = SourceTable.Id
WHEN MATCHED THEN UPDATE SET
Col1 = SourceTable.Col1,
Col2 = SourceTable.Col2,
Col3 = SourceTable.Col3
WHEN NOT MATCHED BY TARGET THEN INSERT
(Id,Col1, Col2,Col3)
VALUES
(SourceTable.Id,SourceTable.Col1, SourceTable.Col2,SourceTable.Col3)
;
The benefit of this approach is that the query will be executed as a single transaction so if there are duplicate rows or similar the table status will be rolled back to previous state.

Related

Upsert / merge tables in SQLite

I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).

Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()

First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).

Update a row with a specific id

id is the first column of my Sqlite table.
row is a list or tuple with the updated content, with the columns in the same order than in the database.
How can I do an update command with:
c.execute('update mytable set * = ? where id = ?', row)
without hardcoding all the column names? (I'm in prototyping phase, and this is often subject to change, that's why I don't want to hardcode the column names now).
Obviously * = ? is probably incorrect, how to modify this?
Also, having where id = ? at the end of the query should expect having id as the last element of row, however, it's the first element of row (because, still, row elements use the same column order as the database itself, and id is first column).

You could extract the column names using the table_info PRAGMA. this will have the column names in order. You could then build the statement in parts and finally combine them.
e.g. for a table defined with :-
CREATE TABLE "DATA" ("idx" TEXT,"status" INTEGER,"unit_val" TEXT DEFAULT (null) );
Then
PRAGMA table_info (data);
returns :-
i.e. you want to extract the name column.
You may be interested in - PRAGMA Statements
An alternative approach would be to extract the create sql from sqlite_master. However that would require more complex code to extract the column names.

Copy row from Cassandra database and then insert it using Python

I'm using plugin DataStax Python Driver for Apache Cassandra.
I want to read 100 rows from database and then insert them again into database after changing one value. I do not want to miss previous records.
I know how to get my rows:
rows = session.execute('SELECT * FROM columnfamily LIMIT 100;')
for myrecord in rows:
print(myrecord.timestamp)
I know how to insert new rows into database:
stmt = session.prepare('''
INSERT INTO columnfamily (rowkey, qualifier, info, act_date, log_time)
VALUES (, ?, ?, ?, ?)
IF NOT EXISTS
''')
results = session.execute(stmt, [arg1, arg2, ...])
My problems are that:
I do not know how to change only one value in a row.
I don't know how to insert rows into database without using CQL. My columnfamily has more than 150 columns and writing all their names in query does not seem as a best idea.
To conclude:
Is there a way to get rows, modify one value from every one of them and then insert this rows into database without using only CQL?

First, you need to select only needed columns from Cassandra - it will be faster to transfer the data. You need to include all columns of primary key + column that you want to change.
After you get the data, you can use UPDATE command to update only necessary column (example from documentation):
UPDATE cycling.cyclist_name
SET comments ='='Rides hard, gets along with others, a real winner'
WHERE id = fb372533-eb95-4bb4-8685-6ef61e994caa
You can also use prepared statement to make it more performant...
But be careful - the UPDATE & INSERT in CQL are really UPSERTs, so if you change columns that are part of primary key, then it will create new entry...

Using SELECT IN against 5 millions+ records

I have a list of entries that is around 6 million in a text file. I have to check against table to return ALL rows are in text file. For that purpose I want to use SEELCT IN. I want to it is OK to convert all of them in a single query and run?
I am using MySQL.

You can create a temporary table or variable in Database insert the values into that table or variable and then you can perform IN operation like given below.
SELECT field
FROM table
WHERE value IN SELECT somevalue from sometable
Thanks

Python: sqlite3 - how to speed up updating of the database

I have a database, which I store as a .db file on my disk. I implemented all the function neccessary for managing this database using sqlite3. However, I noticed that updating the rows in the table takes a large amount of time. My database has currently 608042 rows. The database has one table - let's call it Table1. This table consists of the following columns:
id | name | age | address | job | phone | income
(id value is generated automaticaly while a row is inserted to the database).
After reading-in all the rows I perform some operations (ML algorithms for predicting the income) on the values from the rows, and next I have to update (for each row) the value of income (thus, for each one from 608042 rows I perform the SQL update operation).
In order to update, I'm using the following function (copied from my class):
def update_row(self, new_value, idkey):
update_query = "UPDATE Table1 SET income = ? WHERE name = ?" %
self.cursor.execute(update_query, (new_value, idkey))
self.db.commit()
And I call this function for each person registered in the database.
for each i out of 608042 rows:
update_row(new_income_i, i.name)
(values of new_income_i are different for each i).
This takes a huge amount of time, even though the dataset is not giant. Is there any way to speed up the updating of the database? Should I use something else than sqlite3? Or should I instead of storing the database as a .db file store it in memory (using sqlite3.connect(":memory:"))?

Each UPDATE statement must scan the entire table to find any row(s) that match the name.
An index on the name column would prevent this and make the search much faster. (See Query Planning and How does database indexing work?)
However, if the name column is not unique, then that value is not even suitable to find individual rows: each update with a duplicate name would modify all rows with the same name. So you should use the id column to identify the row to be updated; and as the primary key, this column already has an implicit index.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.