How to remove millions of rows in MySQL?

How to remove millions of rows in MySQL? - python

I have one huge table that I would like to make smaller. It has ~230 Million rows.
Both columns are indexed. The structure is:
+--------------+------------+
| id_my_value | id_ref |
+--------------+------------+
| YYYY | XXXX |
+--------------+------------+
I would have to remove the values that have a particular "id_ref" value. I have tried the following:
sql = f"SELECT id_ref FROM REFS"
cursor.execute(sql)
refs = cursor.fetchall()
limit = 1000
for current in refs:
id = current["id_ref"]
sql = f"DELETE FROM MY_VALUES WHERE id_ref = {id} LIMIT {limit}"
while True:
cursor.execute(sql)
mydb.commit()
if cursor.rowcount == 0:
break
Regardless the value I set to "limit" the query is tremendously slow:
DELETE FROM MY_VALUES WHERE id_ref = XXXX LIMIT 10;
I have also tried the other way around. Select the id_value associated with a particular id_ref, and delete:
SELECT id_value FROM MY_VALUES WHERE id_ref = XXXX LIMIT 10
DELETE FROM MY_VALUES WHERE id_value = YYYY
Here is my EXPLAIN.
EXPLAIN DELETE FROM MY_VALUES WHERE id_ref = YYYY LIMIT 1000;
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------+------------+---------+-------+----------+----------+-------------+
| 1 | DELETE | MY_VALUES | NULL | range | id_ref | id_ref | 5 | const | 20647922 | 100.00 | Using where |
It does use the right INDEX.
I would not have any problem to have this operation running for days ont he server.
What is the right way to approach this "cleaning"?
EDIT
Here is the output from SHOW CREATE TABLE MY_VALUES
MY_VALUES | CREATE TABLE `MY_VALUES` (
`id_my_value` int NOT NULL AUTO_INCREMENT,
`id_document` int NOT NULL,
`id_ref` int DEFAULT NULL,
`value` mediumtext CHARACTER SET utf8 COLLATE utf8_spanish_ci,
`weigth` int DEFAULT NULL,
`id_analysis` int DEFAULT NULL,
`url` text CHARACTER SET utf8 COLLATE utf8_spanish_ci,
`domain` varchar(64) CHARACTER SET utf8 COLLATE utf8_spanish_ci DEFAULT NULL,
`filetype` varchar(16) CHARACTER SET utf8 COLLATE utf8_spanish_ci DEFAULT NULL,
`id_domain` int DEFAULT NULL,
`id_city` int DEFAULT NULL,
`city_name` varchar(32) CHARACTER SET utf8 COLLATE utf8_spanish_ci DEFAULT NULL,
`is_hidden` tinyint NOT NULL DEFAULT '0',
`id_company` int DEFAULT NULL,
`is_hidden_by_user` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id_my_value`),
KEY `id_ref` (`id_ref`),
KEY `id_document` (`id_document`),
KEY `id_analysis` (`id_analysis`),
KEY `weigth` (`weigth`),
KEY `id_domain` (`id_domain`),
KEY `id_city` (`id_city`),
KEY `id_company` (`id_company`),
KEY `value` (`value`(15))
UPDATE
I just tried to remove one register:
DELETE FROM MY_VALUES WHERE id_MY_VALUE = 8
That operation takes "forever". To prevent a timeout, I followed this SO question ,so I have set:
show variables like 'innodb_lock_wait_timeout';
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| innodb_lock_wait_timeout | 100000 |
+--------------------------+--------+

a=0;
limit=1000;
while true
b=a+1000
sql = "delete from VALUES where id>{a} and id<={b}"
cursor.execute(sql)
mydb.commit()
if cursor.rowcount == 0:
break
a=a+1000

First thing to try. Put this right after your second cursor.execute().
cnx.commit()
In connector/python, autocommit is turned off by default. If you don't commit, your MySQL server buffers up all your changes (DELETEs in your case) so it can roll them back if you choose, or if your program crashes.
I guess your slow query is
DELETE FROM `VALUES` WHERE id_ref=constant LIMIT 1000;
Try doing this. EXPLAIN shows you the query plan.
EXPLAIN DELETE FROM `VALUES` WHERE id_ref=constant LIMIT 1000;
It should employ the index on your id_ref row. It's possible your indexes aren't selective enough so your query planner chooses a table scan. In that case you might consider raising the LIMIT so your query does more work each time it runs.
You could try this. If my guess about the table scan is correct, it might help.
DELETE FROM `VALUES` FORCE INDEX (your_index_on_id_ref) WHERE id_ref=constant LIMIT 1000;
(Usually FORCE INDEX is a terrible idea. But this might be the exception.)
You could also try this: create a cleaned up temporary table, then rename tables to put it into service.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
CREATE TABLE purged_values AS
SELECT *
FROM `VALUES`
WHERE id_ref NOT IN (SELECT id_ref FROM `REFS`);
This will take a while. Run it at zero-dark-thirty. The transaction isolation level helps prevent contention with other sessions using the table while this is in progress.
Then you'll have a new, purged, table. You can index it, then do these renames to put it into service.
ALTER TABLE `VALUES` RENAME TO old_values;
ALTER TABLE purged_values RENAME to `VALUES';

Finally I did a bit more experimentation and found a way.
First step
The python loop to delete the entries on the DB was running for ~12h. I addedcouple of lines to measure the time execution:
start = time.time()
cursor.execute(sql)
mydb.commit()
end = time.time()
Here is a sample of the first measurements:
1 > 900 > 0.4072246551513672
2 > 900 > 1.7270898818969727
3 > 900 > 1.8365845680236816
4 > 900 > 1.124634027481079
5 > 900 > 1.8552422523498535
6 > 900 > 13.80513596534729
7 > 900 > 8.379877090454102
8 > 900 > 10.675175428390503
9 > 900 > 6.14388370513916
10 > 900 > 11.806004762649536
11 > 900 > 12.884040117263794
12 > 900 > 23.604055881500244
13 > 900 > 19.162535905838013
14 > 900 > 24.980825662612915
....
It went for an average of ~30s per execution after 900 iterations. Picture attached for reference:
In my case this have would taken ~80 days to remove all the rows with this implementation.
Final solution
Created a temporary table with the appropiate value, index, etc...
CREATE TABLE ZZ_MY_VALUES AS
SELECT * FROM ZZ_MY_VALUES WHERE ZZ_MY_VALUES.id_ref IN
(
SELECT id_ref FROM MY_REFS WHERE id_ref = 3 OR id_ref = 4 OR id_ref = 5
)
It took ~3h and went from 230M rows to 21M rows.
A bit quicker than the orignal statimation of 3 months. :)
Thanks all for your tips.

Related

How do I put a placeholder for a column name in mysql python connector?

so I know how to use %s but it doesn't appear to work for a column name.
my aim here is to get a column name (a roll number here) and use that to find information (how many days they attended)
roll=input("enter roll no.: ")
c1.execute("select sum(%s) from attendance", ("" + roll + "",))
a=c1.fetchall()
the table looks like:
date | 11b1 | 11b2 | 11b3 |......| 11b45 |
2020-12-01 | 1 | 0 | 1 |......| 1 |
2020-12-02 | 1 | 1 | 1 |......| 0 |
2020-12-03 | 0 | 1 | 1 |......| 1 |
this doesn't work and seems to give me a random value
so how do I write that middle code? also why does the original code not give errors but still give an arbitrary seeming number?

I will assume you mean the columns_names;
According toPython 3.8
By using f string¹
roll = input("enter roll name.: ")
a = c1.execute(f"select sum({roll}) from attendance").fetchall()

The names of MySQL schema objects - tables, columns etc - can be interpolated using string formatting, by surrounding the placeholder with backticks ('`', ASCII 0x96). See MySQL Identifiers.
Using backticks prevents errors if the column name contains a space, or matches a keyword or reserved word.
However backticks do not protect against SQL injection. As the programmer, it's your responsibility to make sure that any column names coming from outside your program (for example, from user input) are verified as matching the column names in the table.
colname = 'roll'
sql = f"""SELECT sum(`{colname}`) FROM attendance"""
mycursor.execute(sql)
For values in INSERT or UPDATE statements, or WHERE clauses, DB-API parameter substitution should always be used.
colname = 'roll'
colvalue = 42
sql = f"""SELECT sum(`{colname}`) FROM attendance WHERE colvalue = %s"""
mycursor.execute(sql, (colvalue,))

SQL - Possible to Auto Increment Number but with leading zeros?

Our IDs look something like this "CS0000001" which stands for Customer with the ID 1. Is this possible to to with SQL and Auto Increment or do i need to to that in my GUI ?
I need the leading zeroes but with auto incrementing to prevent double usage if am constructing the ID in Python and Insert them into the DB.
Is that possible?

You have few choices:
Construct the CustomerID in your code which inserts the data into
the Customer table (=application side, requires change in your code)
Create a view on top of the Customer-table that contains the logic
and use that when you need the CustomerID (=database side, requires change in your code)
Use a procedure to do the inserts and construct the CustomerID in
the procedure (=database side, requires change in your code)

Possible realization.
Create data table
CREATE TABLE data (id CHAR(9) NOT NULL DEFAULT '',
val TEXT,
PRIMARY KEY (id));
Create service table
CREATE TABLE ids (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY);
Create trigger which generates id value
CREATE TRIGGER tr_bi_data
BEFORE INSERT
ON data
FOR EACH ROW
BEGIN
INSERT INTO ids () VALUES ();
SET NEW.id = CONCAT('CS', LPAD(LAST_INSERT_ID(), 7, '0'));
DELETE FROM ids;
END
Create trigger which prohibits id value change
CREATE TRIGGER tr_bu_data
BEFORE UPDATE
ON data
FOR EACH ROW
BEGIN
SET NEW.id = OLD.id;
END
Insert some data, check result
INSERT INTO data (val) VALUES ('data-1'), ('data-2');
SELECT * FROM data;
id | val
:-------- | :-----
CS0000001 | data-1
CS0000002 | data-2
Try to update, ensure id change prohibited
UPDATE data SET id = 'CS0000100' WHERE val = 'data-1';
SELECT * FROM data;
id | val
:-------- | :-----
CS0000001 | data-1
CS0000002 | data-2
Insert one more data, ensure enumeration continues
INSERT INTO data (val) VALUES ('data-3'), ('data-4');
SELECT * FROM data;
id | val
:-------- | :-----
CS0000001 | data-1
CS0000002 | data-2
CS0000003 | data-3
CS0000004 | data-4
Check service table is successfully cleared
SELECT COUNT(*) FROM ids;
| COUNT(*) |
| -------: |
| 0 |
db<>fiddle here
Disadvantages:
Additional table needed.
Generated id value edition is disabled (copy and delete old record must be used instead, custom value cannot be set).

Get the columns that were actually updated to a different value?

I'm running something like this:
cursor.execute(
'''
UPDATE my_table
SET name=%s, about=%s, title=%s
WHERE id=%s
''',
(name, about, title, id_)
)
Which is guaranteed to only update one row, since it's doing the update based on the id primary key.
However, most of the time only one of the fields actually changes, i.e. the about and title are "updated" to the same value they were already, and only name has actually changed.
How can I get which of the columns actually changed? This is needed to log every individual change.

you can select values before update and using RETURNING * compare values in final query, like here:
t=# create table m1 (i int, e int);
CREATE TABLE
Time: 1.855 ms
t=# insert into m1 select 1,2;
INSERT 0 1
Time: 1.037 ms
t=# begin;
BEGIN
t=# with o as (select * from m1 where i=1)
,u as (update m1 set e=3 where i=1 returning *)
select * from o
join u on o.i = u.i
;
i | e | i | e
---+---+---+---
1 | 2 | 1 | 3
(1 row)
so you can put logic against u.e <> o.e or alike

python and mysql.connector requires quoting of unsigned int

I'm using mysql connector 1.0.9. and Python 3.2
This query fails due to a syntax error (mysql.connector throws ProgrammingError, the specific MySQL error is just "there is a syntax error to the right of "%(IP)s AND DATE_SUB(NOW(), INTERVAL 1 HOUR) < accessed":
SELECT COUNT(*) FROM bad_ip_logins WHERE IP = %(IP)s AND DATE_SUB(NOW(), INTERVAL 1 HOUR) < accessed
But if I quote the variable IP, it works:
SELECT COUNT(*) FROM bad_ip_logins WHERE IP = '%(IP)s' AND DATE_SUB(NOW(), INTERVAL 1 HOUR) < accessed
In context:
IP = 1249764151 # IP converted to an int
conn = mysql.connector.connect(db_params)
curs = conn.cursor()
query = "SELECT COUNT(*) FROM bad_ip_logins WHERE IP = %(IP)s AND DATE_SUB(NOW(), INTERVAL 1 HOUR) < accessed"
params = {'IP', IP}
curs.execute(query, params)
My understanding is that you never have to quote variables for a prepared statement (and this is true for every other query in my code, even ones that access the IP variable on this table). Why do I need to quote it in this single instance, and nowhere else?
If this isn't doing a prepared statement I'd be interested in hearing about that as well. I wasn't able to inject anything with this - was it just quoting it in such a way as to prevent that?
If it matters, this is the table description:
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| IP | int(10) unsigned | YES | | NULL | |
| user_id | int(11) | YES | | NULL | |
| accessed | datetime | YES | | NULL | |
+----------+------------------+------+-----+---------+-------+

Do not use string interpolation. Leave the SQL parameter to the database adapter:
cursor.execute('''\
SELECT COUNT(*) FROM bad_ip_logins WHERE IP = %s AND DATE_SUB(NOW(), INTERVAL 1 HOUR) < accessed''', (ip,))
Here, we pass the parameter ip in to the execute() call as a separate parameter (in a tuple, to make it a sequence), and the database adapter will take care of proper quoting, filling in the %s placeholder.

How to get matched Rows from MySQLdb.cursors.Cursor python2.6

I'm working with python2.6 and MySQLdb. I have a table with this data
+----+--------+
| id | status |
+----+--------+
| 1 | A |
| 2 | B |
| 3 | B |
+----+--------+
I want to do an mysql update like this example:
UPDATE my_table SET status = "A" where id in (1,2,3,10001);
Query OK, 2 rows affected (0.03 sec)
Rows matched: 3 Changed: 2 Warnings: 0
And I need to know if all the ids in the update exits in the database. My idea to get this information was to compare the number of items I tried to update vs the number of matched rows. In the example the numbers are 4 vs 3.
The problem is that i don't know how to get the "Matched Rows" from the cursor information. I only see this information in cursor._info = 'Rows matched: 3 Changed: 2 Warnings: 0'.
The cursor.rowcount is the number of changed rows, so =(
Thanks!

If cursor._info contains that string, then you can just extract the 3 with a regex: re.search(r'Rows matched: (\d+)', cursor._info).group(1)
Alternatively, if you are using InnoDB tables (which support transactions), you can execute two queries: first just SELECT id FROM my_table WHERE id in (1,2,3,10001) and then get cursor.rowcount which will return the number of matching rows. Then execute your update. All queries run in the same cursors are part of the same transaction, so you are guaranteed that no other process will write the database between the queries.
Sources: see http://zetcode.com/databases/mysqlpythontutorial/

The FOUND_ROWS option makes cursor.rowcount return the number of matched rows instead:
db_connection = MySQLdb.connect(
host = settings['dbHost'],
user = settings['dbUser'],
passwd = settings['dbPass'],
db = settings['dbName'],
client_flag = MySQLdb.constants.CLIENT.FOUND_ROWS
)
Docs:
http://mysql-python.sourceforge.net/MySQLdb-1.2.2/public/MySQLdb.constants.CLIENT-module.html
http://dev.mysql.com/doc/refman/5.6/en/mysql-real-connect.html
(There's a typo in the MySQLdb docs. "client_flags" should be "client_flag")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove millions of rows in MySQL? - python

a=0; limit=1000; while true b=a+1000 sql = "delete from VALUES where id>{a} and id<={b}" cursor.execute(sql) mydb.commit() if cursor.rowcount == 0: break a=a+1000

Related

How do I put a placeholder for a column name in mysql python connector?

SQL - Possible to Auto Increment Number but with leading zeros?

Get the columns that were actually updated to a different value?

python and mysql.connector requires quoting of unsigned int

How to get matched Rows from MySQLdb.cursors.Cursor python2.6

Categories

Resources