SQLITE - unable to combine thousands of delete statements into single transaction - python

I'm facing an issue with deleting thousands of rows from a single table takes a long time (over 2 minutes for 14k records) while inserting the same records is nearly instant (200ms). Both the insert and delete statements are handled identically - a loop generates the statements and appends them to a list, then the list is passed to a separate function that opens a transaction, executes all the statements and then finishes with commit. At least this was my impression before I started testing this with pseudocode - but it looks like I misunderstood the need for opening the transaction manually.
I've read about transactions (https://www.sqlite.org/faq.html#q19) but since the inserts are pretty much instant then I am not sure whether this is the case here.
My understanding is that transaction == commit, and if this is correct then it looks like all the delete statements are in a single transaction - mid-processing I can see all the deleted rows until the final commit, after which they are actually deleted. Ie the situation in the FAQ link below should be different - since no commit takes place. But the slow speed indicates that it is still doing something else that is slowing things down as if each delete statement were a separate transaction.
After running the pseudocode it appears that while the changes are not committed until explicit commit is sent (via conn.commit()) but the "begin" or "begin transaction" in front of the loop does not have any effect. I think this is because sqlite3 sends the "begin" automatically in the background ( Merge SQLite files into one db file, and 'begin/commit' question )
Pseudocode to test this out:
import sqlite3
from datetime import datetime
insert_queries = []
delete_queries = []
rows = 30000
for i in range(rows):
insert_queries.append(f'''INSERT INTO test_table ("column1") VALUES ("{i}");''')
for i in range(rows):
delete_queries.append(f'''DELETE from test_table where column1 ="{i}";''')
conn = sqlite3.connect('/data/test.db', check_same_thread=False)
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print('*'*50)
print(f'Starting inserts: {timestamp}')
# conn.execute('BEGIN TRANSACTION')
for query in insert_queries:
conn.execute(query)
conn.commit()
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Finished inserts: {timestamp}')
print('*'*50)
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Starting deletes: {timestamp}')
# conn.isolation_level = None
# conn.execute('BEGIN;')
# conn.execute('BEGIN TRANSACTION;')
for query in delete_queries:
conn.execute(query)
conn.commit()
timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S")
print(f'Finished deletes: {timestamp}')
One weird thing is that the row count exponentially increases the delete time (2s to delete 10k rows, 7s to delete 20k rows, 43s to delete 50k rows) while the insert time is instant regardless of row count.
EDIT:
The original question was - why does the delete statement take so much more time compared to insert statement and how to speed it up so the speeds for both inserting and deleting rows would be similar.
As per snakecharmerb's suggestion one workaround around this would be to do it like this:
rows = 100000
delete_ids = ''
for i in range(rows):
if delete_ids:
delete_ids += f',"{i}"'
else:
delete_ids += f'"{i}"'
delete_str = f'''DELETE from test_table where column1 IN ({delete_ids});'''
conn.execute(delete_str)
conn.commit()
While this is most likely against all best-practices, it does seem to work - it takes around 2 seconds to delete 1mil rows.

I tried batching the deletes in sets of 50:
...
batches = []
batch = []
for i in range(rows):
batch.append(str(i))
if len(batch) == 50:
batches.append(batch)
batch = []
if batch:
batches.append(batch)
...
base = 'DELETE FROM test_table WHERE column1 IN ({})'
for batch in batches:
placeholders = ','.join(['?'] * len(batch))
sql = base.format(placeholders)
conn.execute(sql, batch)
conn.commit()
...
and this reduced the duration to 1 - 2 seconds (from 6 - 8 originally).
Combining this approach with executemany resulted in a 1 second duration.
Using a query to define the deleted columns was almost instant
DELETE FROM test_table WHERE column1 IN (SELECT column1 FROM test_table)
but it's possible Sqlite recognises that this query is the same as a bare DELETE FROM test_table and optimises.
Switching off the secure_delete PRAGMA seemed to make performance even worse.

Related

Efficient way to delete a large amount of records from a big table using python

I have a large table (About 10 million rows) that I need to delete records that are "older" than 10 days (according to created_at column). I have a python script that I run to do this. created_at is a varchar(255) and has values like for e.g. 1594267202000
import mysql.connector
import sys
from mysql.connector import Error
table = sys.argv[1]
deleteDays = sys.argv[2]
sql_select_query = """SELECT COUNT(*) FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))""".format(table)
sql_delete_query = """DELETE FROM {} WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY)) LIMIT 100""".format(table)
try:
connection = mysql.connector.connect(host=localhost,
database=myDatabase,
user=admin123,
password=password123)
cursor = connection.cursor()
#initial count of rows before deletion
cursor.execute(sql_select_query, (deleteDays,))
records = cursor.fetchone()[0]
while records >= 1:
# stuck at following below line and time out happens....
cursor.execute(sql_delete_query, (deleteDays,))
connection.commit()
cursor.execute(sql_select_query, (deleteDays,))
records = cursor.fetchone()[0]
#final count of rows after deletion
cursor.execute(sql_select_query, (deleteDays,))
records = cursor.fetchone()[0]
if records == 0:
print("\nRows deleted")
else:
print("\nRows NOT deleted")
except mysql.connector.Error as error:
print("Failed to delete: {}".format(error))
finally:
if (connection.is_connected()):
cursor.close()
connection.close()
print("MySQL connection is closed")
When I run this script and it runs the DELETE QUERY however... it fails due to:
Failed to delete: 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
I know that the innodb_lock_wait_timeout is currently set to 50 seconds and I can increase it to overcome this problem, however i'd rather not touch the timeout and.... I want to basically delete in chunks maybe? Anyone know how I can do it here using my code as example?
One approach here might be to use a delete limit query, to batch your deletes at a certain size. Assuming batches of 100 records:
DELETE
FROM yourTable
WHERE created_at / 1000 < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL %s DAY))
LIMIT 100;
Note that strictly speaking you should always have an ORDER BY clause when using LIMIT. What I wrote above might delete any 100 records matching the criteria for deletion.
created_at has no index and is a varchar(255) – Saffik 11 hours ago
There's your problem. Two of them.
It needs to be indexed to have any hope of being performant. Without an index, MySQL has to check every record in the table. With an index, it can skip straight to the ones which match.
While storing an integer as a varchar will work, MySQL will convert it for you, it's bad practice; it wastes storage, allows bad data, and is slow.
Change created_at to a bigint so that it's stored as a number, then index it.
alter table your_table modify column created_at bigint;
create index created_at_idx on your_table(created_at);
Now that created_at is an indexed bigint, your query should use the index and it should be very fast.
Note that created_at should be a datetime which stores the time at microsecond accuracy. Then you can use MySQL's date functions without having to convert.
But that's going to mess with your code which expects a millisecond epoch number, so you're stuck with it. Keep it in mind for future tables.
For this table, you can add a generated created_at_datetime column to make working with dates easier. And, of course, index it.
alter table your_table add column created_at_datetime datetime generated always as (from_unixtime(created_at/1000));
create index created_at_datetime on your_table(created_at_datetime);
Then your where clause becomes much simpler.
WHERE created_at_datetime < DATE_SUB(NOW(), INTERVAL %s DAY)

pd.read_sql method to count number of rows in a large Access database

I am trying to read the number of rows in a large access database and I am trying to find the most efficient method. Here is my code:
driver = 'access driver as string'
DatabaseLink = 'access database link as string'
Name = 'access table name as string'
conn = pyodbc.connect(r'Driver={' + driver + '};DBQ=' + DatabaseLink +';')
cursor = conn.cursor()
AccessSize = cursor.execute('SELECT count(1) FROM '+ Name).fetchone()[0]
connection.close()
This works and AccessSize does give me an integer with the number of rows in the database, however it takes far too long to compute (my database has over 2 million rows and 15 columns).
I have attempted to read the data through pd.read_sql and used the chunksize functionality to loop through and keep counting the length of each chunk but this also takes long. I have also attempted .fetchall in the cursor execute section but the speed is similar to .fetchone
I would have thought there would be a faster method to quickly calculate the length of the table as I don't require the entire table to be read. My thought is to find the index value of the last row as this essentially is the number of rows but I am unsure how to do this.
Thanks
From comment to the question:
Unfortunately the database doesn't have a suitable keys or indexes in any of its columns.
Then you can't expect good performance from the database because every SELECT will be a table scan.
I have an Access database on a network share. It contains a single table with 1 million rows and absolutely no indexes. The Access database file itself is 42 MiB. When I do
t0 = time()
df = pd.read_sql_query("SELECT COUNT(*) AS n FROM Table1", cnxn)
print(f'{time() - t0} seconds')
it takes 75 seconds and generates 45 MiB of network traffic. Simply adding a primary key to the table increases the file size to 48 MiB, but the same code takes 10 seconds and generates 7 MiB of network traffic.
TL;DR: Add a primary key to the table or continue to suffer from poor performance.
2 million should not take that long. I have Use pd.read_sql(con, sql) like this:
con = connection
sql = """ my sql statement
here"""
table = pd.read_sql(sql=sql, con=con)
Are you doing something different?
In my case I am using a db2 database, maybe that is why is faster.

Creating tables in MySQL based on the names of the columns in another table

I have a table with ~133M rows and 16 columns. I want to create 14 tables on another database on the same server for each of columns 3-16 (columns 1 and 2 are `id` and `timestamp` which will be in the final 14 tables as well but won't have their own table), where each table will have the name of the original column. Is this possible to do exclusively with an SQL script? It seems logical to me that this would be the preferred, and fastest way to do it.
Currently, I have a Python script that "works" by parsing the CSV dump of the original table (testing with 50 rows), creating new tables, and adding the associated values, but it is very slow (I estimated almost 1 year to transfer all 133M rows, which is obviously not acceptable). This is my first time using SQL in any capacity, and I'm certain that my code can be sped up, but I'm not sure how because of my unfamiliarity with SQL. The big SQL string command in the middle was copied from some other code in our codebase. I've tried using transactions as seen below, but it didn't seem to have any significant effect on the speed.
import re
import mysql.connector
import time
# option flags
debug = False # prints out information during runtime
timing = True # times the execution time of the program
# save start time for timing. won't be used later if timing is false
start_time = time.time()
# open file for reading
path = 'test_vaisala_sql.csv'
file = open(path, 'r')
# read in column values
column_str = file.readline().strip()
columns = re.split(',vaisala_|,', column_str) # parse columns with regex to remove commas and vasiala_
if debug:
print(columns)
# open connection to MySQL server
cnx = mysql.connector.connect(user='root', password='<redacted>',
host='127.0.0.1',
database='measurements')
cursor = cnx.cursor()
# create the table in the MySQL database if it doesn't already exist
for i in range(2, len(columns)):
table_name = 'vaisala2_' + columns[i]
sql_command = "CREATE TABLE IF NOT EXISTS " + \
table_name + "(`id` BIGINT(20) NOT NULL AUTO_INCREMENT, " \
"`timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, " \
"`milliseconds` BIGINT(20) NOT NULL DEFAULT '0', " \
"`value` varchar(255) DEFAULT NULL, " \
"PRIMARY KEY (`id`), " \
"UNIQUE KEY `milliseconds` (`milliseconds`)" \
"COMMENT 'Eliminates duplicate millisecond values', " \
"KEY `timestamp` (`timestamp`)) " \
"ENGINE=InnoDB DEFAULT CHARSET=utf8;"
if debug:
print("Creating table", table_name, "in database")
cursor.execute(sql_command)
# read in rest of lines in CSV file
for line in file.readlines():
cursor.execute("START TRANSACTION;")
line = line.strip()
values = re.split(',"|",|,', line) # regex split along commas, or commas and quotes
if debug:
print(values)
# iterate of each data column. Starts at 2 to eliminate `id` and `timestamp`
for i in range(2, len(columns)):
table_name = "vaisala2_" + columns[i]
timestamp = values[1]
# translate timestamp back to epoch time
try:
pattern = '%Y-%m-%d %H:%M:%S'
epoch = int(time.mktime(time.strptime(timestamp, pattern)))
milliseconds = epoch * 1000 # convert seconds to ms
except ValueError: # errors default to 0
milliseconds = 0
value = values[i]
# generate SQL command to insert data into destination table
sql_command = "INSERT IGNORE INTO {} VALUES (NULL,'{}',{},'{}');".format(table_name, timestamp,
milliseconds, value)
if debug:
print(sql_command)
cursor.execute(sql_command)
cnx.commit() # commits changes in destination MySQL server
# print total execution time
if timing:
print("Completed in %s seconds" % (time.time() - start_time))
This doesn't need to be incredibly optimized; it's perfectly acceptable if the machine has to run for a few days in order to do it. But 1 year is far too long.
You can create a table from a SELECT like:
CREATE TABLE <other database name>.<column name>
AS
SELECT <column name>
FROM <original database name>.<table name>;
(Replace the <...> with your actual object names or extend it with other columns or a WHERE clause or ...)
That will also insert the data from the query into the new table. And it's probably the fastest way.
You could use dynamic SQL and information from the catalog (namely information_schema.columns) to create the CREATE statements or create them manually, which is annoying but acceptable for 14 columns I guess.
When using scripts to talk to databases you want to minimise the number of messages that are sent as each message creates a further delay on your execution time. Currently, it looks as if you are sending (by your approximation) 133 million messages, and thus, slowing down your script 133 million times. A simple optimisation would be to parse your spreadsheet and split the data into the tables (either in memory or saving them to disk) and only then send the data to the new DB.
As you hinted, it's much quicker to write an SQL script to redistribute the data.

cleaning a Postgres table of bad rows

I have inherited a Postgres database, and am currently in the process of cleaning it. I have created an algorithm to find the rows where the data is bad. The algorithm is encoded into the function called checkProblems(). Using this, I am able to select the rows that contains the bad rows, as shown below ...
schema = findTables(dbName)
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
results = []
for t in tqdm(sorted(schema.keys())):
n = 0
cur.execute('select * from %s'%t)
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
n += 1
results.append({
'tableName': t,
'totalRows': i+1,
'badRows' : n,
})
cur.close()
conn.close()
print pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
Now, I need to delete the rows that are bad. I have two different ways of doing it. First, I can write the clean rows in a temporary table, and rename the table. I think that this option is too memory-intensive. It would be much better if I would be able to just delete the specific record at the cursor. Is this even an option?
Otherwise, what is the best way of deleting a record under such circumstances? I am guessing that this should be a relatively common thing that database administrators do ...
Of course that delete the specific record at the cursor is better. You can do something like:
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
# if cs is a tuple with cs[0] being the record id.
cur.execute('delete from %s where id=%d'%(t, cs[0]))
Or you can store the ids of the bad records and then do something like
DELETE FROM table WHERE id IN (id1,id2,id3,id4)

MySql: among 20000 rows of records only shows 5, why?

I want to test my MySQL database and how it handles my future data. it is only a table with two columns, one of the column is only one word and another one is 30.000 characters in. So I copied and inserted into the same table 20.000 times that shows the size is 2.0 GB. Now I want to browse them through phpMyAdmin it shows nothing and every destroyed that table to show anything. I output it through python it only shows 5 rows that were inserted before this copy. I used a script to delete rows from IDs between 5000 - 10.000 it works. That means that data is there but doesn't come out. Any explanation?
import MySQLdb as mdb
con = mdb.connect('127.0.0.1', 'root','password', 'database')
title = []
entry = []
y = 0
with con:
conn = con.cursor()
conn.execute("SELECT * FROM mydatabase WHERE id='2' AND myword = 'jungleboy'")
rows = conn.fetchall()
for i in rows:
title.append(i[1])
entry.append(i[2])
for x in range(20000):
cur.execute("INSERT INTO mydatabase(myword,explanation) VALUES (%s,%s)",(str(title[0]),str(entry[0])))
if x > y+50:
print str(x)
y = x
I'm not sure I understand your question, but here are some tips with the code you have pasted.
After any INSERT or other query that adds, removes or changes data in a table, you need to commit the transaction with con.commit().
There is a limit on how many records can be fetched with fetchall(). You can see and adjust this limit by printing the arraysize attribute of the cursor:
print 'I can only fetch {0.arraysize} rows at a time.'.format(cur)
To guarantee that you are fetching every row, loop through the results, like this:
q = "SELECT .... " # some query that returns a lot of results
conn.execute(q)
rows = con.fetchone() # or fetchall()
while rows:
print rows
rows = con.fetchone() # again, or fetchall()

Categories