Deleting duplicate rows in "large" sqlite table takes too much time (Python) - python

I have a relatively small sqlite3 database (~2.6GB) with 820k rows and 26 columns (single table). I run an iterative process, and every time new data is generated, the data is placed in a pandas dataframe, and inserted into the sqlite database with the function insert_values_to_table. This process operates fine and is very fast.
After every data insert, the database is sanitized from its duplicate row listings (all 26 columns need to be duplicate) with the function sanitize_database. This operation connects to the database in similar fashion, creates a cursor, and executes the following logic: Create new temporary_table with only unique values from original table --> Delete all rows from original table --> Insert all rows from temporary table into empty original table --> Drop the temporary table.
It works, but the sanitize_database function is extremely slow, and can easily take up to an hour for even this small dataset. I tried to set a certain column as primary key, or to unique value, however, pandas.DataFrame.to_sql does not allow for this operation as it can either insert the whole dataframe at once, or none at all. That functionality can be reviewed here (append_skipdupes).
Is there a method to make this process more efficient?
#Function to insert pandas dataframes into SQLITE3 database
def insert_values_to_table(table_name, output):
conn = connect_to_db("/mnt/wwn-0x5002538e00000000-part1/DATABASE/table_name.db")
#if connection exists perform data insertion
if conn is not None:
c = conn.cursor()
#Add pandas data (output) into sql database
output.to_sql(name=table_name, con=conn, if_exists='append', index=False)
#Close connection
conn.close()
print('SQL insert process finished')
#To keep only unique rows in SQLITE3 database
def sanitize_database():
conn = connect_to_db("/mnt/wwn-0x5002538e00000000-part1/DATABASE/table_name.db")
c = conn.cursor()
c.executescript("""
CREATE TABLE temp_table as SELECT DISTINCT * FROM table_name;
DELETE FROM table_name;
INSERT INTO table_name SELECT * FROM temp_table;
DROP TABLE temp_table
""")
conn.close()

Related

How to updated a table (accessed in pandas) in DuckDB Database?

I'm working on one of use case, I have a larger volumes of records created in a duckdb database table, these tables can be accessed in pandas dataframe, do the data manipulations and send them back to DB table. here below I will explain my case.
I have a DB called as MY_DB in Duck DB and a table in it as ROLL_TABLE_A, here it will be queried and converted as a Pandas data frame as _DF.
The same table (ROLL_TABLE_A) can be accessed by multiple users and do the required updates on the data frame _DF.
How to upload the data frame _DF into the same table ROLL_TABLE_A?.
Steps to reproduce:
# Connection and cursor creations
dbas_db_con = duckdb.connect('MY_DB.db')
# list of DB TABLE
dbas_db_con.execute("SHOW TABLES").df()
# Query on DB Table
dbas_db_con.execute("SELECT *FROM ROLL_TABLE_A").df()
# convert database table to pandas table
_df = dbas_db_con.execute("SELECT *FROM ROLL_TABLE_A").df()
Here on _df id field is filled up with multiple user and after updating pandas dataframe it will be as.
Here the updated dataframe to be updated in ROLL_TABLE_A Table in DuckDB.
dbas_db_con.execute("SELECT *FROM ROLL_TABLE_A").df()
On accessing a ROLL_TABLE_A it will produce an output as
Here is a function that takes a dataframe, table name and database path as input and writes the dataframe to the table:
def df_to_duckdb(df:pd.DataFrame, table:str, db_path:str):
con = duckdb.connect(database=db_path, read_only=False)
# register the df in the database so it can be queried
con.register("df", df)
query = f"create or replace table {table} as select * from df"
con.execute(f"{query}")
con.close()
The part that took me a while to figure out is registering the df as a relation (e.g. table/view) in the database. Registering does not write the df to the database, but is essentially a pointer within the database that references the df in memory.

Upsert / merge tables in SQLite

I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).

How to create/populate a SQLITE table from JOIN command?

I am trying to join two tables on a column and then populate a new table with the query results.
I know that the join command gives me the table data I want but now how do I insert this data into a new table without having to loop through the results as there are many unique column names. Is there a way to do this with a SQLite command? To do this without SQLite command would require nested for loops and become computationally expensive (if it even works).
Join command that works:
connection = sqlite3.connect("database1.db")
c = connection.cursor()
c.execute("ATTACH DATABASE 'database1.db' AS db_1")
c.execute("ATTACH DATABASE 'database2.db' AS db_2")
c.execute("SELECT * FROM db_1.Table1Name AS a JOIN db_2.Table2Name AS b WHERE a.Column1 = b.Column2")
Attempt to join and insert command that does not error but does not populate the table:
c.execute("INSERT INTO 'NewTableName' SELECT * FROM db_1.Table1Name AS a JOIN db_2.Table2Name AS b WHERE a.Column1 = b.Column2")
the sql part is:
CREATE TABLE new_table AS
SELECT expressions
FROM existing_tables
[WHERE conditions];

Writing Python Dataframe to MSSQL Table

I currently have a Python dataframe that is 23 columns and 20,000 rows.
Using Python code, I want to write my data frame into a MSSQL server that I have the credentials for.
As a test I am able to successfully write some values into the table using the code below:
connection = pypyodbc.connect('Driver={SQL Server};'
'Server=XXX;'
'Database=XXX;'
'uid=XXX;'
'pwd=XXX')
cursor = connection.cursor()
for index, row in df_EVENT5_15.iterrows():
cursor.execute("INSERT INTO MODREPORT(rowid, OPCODE, LOCATION, TRACKNAME)
cursor.execute("INSERT INTO MODREPORT(rowid, location) VALUES (?,?)", (5, 'test'))
connection.commit()
But how do I write all the rows in my data frame table to the MSSQL server? In order to do so, I need to code up the following steps in my Python environment:
Delete all the rows in the MSSQL server table
Write my dataframe to the server
When you say Python data frame, I'm assuming you're using a Pandas dataframe. If it's the case, then you could use the to_sql function.
df.to_sql("MODREPORT", connection, if_exists="replace")
The if_exists argument set to replace will delete all the rows in the existing table before writing the records.
I realise it's been a while since you asked but the easiest way to delete ALL the rows in the SQL server table (point 1 of the question) would be to send the command
TRUNCATE TABLE Tablename
This will drop all the data in the table but leave the table and indexes empty so you or the DBA would not need to recreate it. It also uses less of the transaction log when it runs.

Python 3 with SQL insert query results in error "column count doesn't match value count at row"

I am trying to upload data from a csv file (its on my local desktop) to my remote SQL database. This is my query
dsn = "dsnname";pwd="password"
import pyodbc
csv_data =open(r'C:\Users\folder\Desktop\filename.csv')
def func(dsn):
cnnctn=pyodbc.connect(dsn)
cnnctn.autocommit =True
cur=cnnctn.cursor()
for rows in csv_data:
cur.execute("insert into database.tablename (colname) value(?)", rows)
cur.commit()
cnnctn.commit()
cur.close()
cnnctn.close()
return()
c=func(dsn)
The problem is that all of my data gets uploaded in one col- that I specified. If I don't specify a col name it won't run. I have 9 cols in my database table and I want to upload this data into separate cols.
When you insert with SQL, you need to make sure you are telling which columns you want to be inserting on. For example, when you execute:
INSERT INTO table (column_name) VALUES (val);
You are letting SQL know that you want to map column_name to val for that specific row. So, you need to make sure that the number of columns in the first parentheses matches the number of values in the second set of parentheses.

Categories