purging a huge data mysql table using python

purging a huge data mysql table using python - python

I have a 1000M data table where i need to have a automated script just keeping last 7 days and delete the before days. I want to do it using python and chunks concept. Want to delete chunk wise.
do we have any library with this chunk concept related to mysql on python?
If no, can anyone suggest me a best method of how to use chunk or apply this with mysql

I'm unaware of a Python package that has an API for "chunking" deletes from a MySQL table. SqlAlchemy provides a fluent interface that can do this but it's not much different than the SQL. I suggest using PyMySql.
import datetime
import pymysql.cursors
connection = pymysql.connect(
host='host',
user='user',
password='password',
database='database'
)
seven_days_before_now = datetime.datetime.now() - datetime.timedelta(days=7)
chunksize = 1000
with connection.cursor() as cursor:
sql = 'DELETE FROM `mytable` WHERE `timestamp` < %s ORDER BY `id` LIMIT %s;'
num_deleted = None
while num_deleted != 0:
num_deleted = cursor.execute(sql, (seven_days_before_now, chunksize))
connection.commit()
The LIMIT just limits the number of deleted rows to the chunksize. The ORDER BY ensures that the DELETE is deterministic and it sorts by the primary key because the primary key is guaranteed to be indexed; so even though it sorts for each chunk, at least it's sorting on an indexed column. Remove the ORDER BY if deterministic behavior is unnecessary, it will result in faster execution time.
You'll need to replace the connection details, table name, column name and chunksize. Also, this solution assumes that the table has a column named id which is the primary key and an auto-incrementing integer. You'll need to make some changes if your schema differs.
As Bernd Buffen commented: the correct way to get the behavior you desire is to partition the table. Please consider a migration to do so.
And, of course: stop using Python 2, it's been unsupported for almost two years as of the first version of this answer.

Related

Sql Select statement Optimization

I have made an test table in sql with the following information schema as shown:
Now I extract this information using the python script the code of which is as shown:
import pandas as pd
import mysql.connector
db = mysql.connector.connect(host="localhost", user="root", passwd="abcdef")
pointer = db.cursor()
pointer.execute("use holdings")
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
pointer.execute(x)
rows = pointer.fetchall()
rows = pd.DataFrame(rows)
stock = rows[1]
The production table contains 200 unique trading symbols and has the schema similar to the test table.
My doubt is that for the following statement:
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
I will have to replace value of tradingsymbols 200 times which is ineffective.
Is there an effective way to do this?

If I understand you correctly, your problem is that you want to avoid sending multiple queries for each trading symbol, correct? In this case the following MySQL IN might be of help. You could then simply send one query to the database containing all tradingsymbols you want. If you want to do different things with the various trading symbols, you could select the subsets within pandas.
Another performance improvement could be pandas.read_sql since this speeds up the creation of the dataframe somewhat
Two more things to add for efficiency:
Ensure that tradingsymbols is indexed in MySQL for faster lookup processes
Make tradingsymbols an ENUM to ensure that no typos or alike are accepted. Otherwise the above-mentioned "IN" method also does not work since it does a full-text comparison.

Optimizing an Update statement with many records in SQLAlchemy

I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)

Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).

According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)

Bulk upsert (insert-update) a csv in postgres [duplicate]

A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".

9.5 and newer:
PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.
Comparison with ON DUPLICATE KEY UPDATE.
Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.
This article discusses the problem in useful detail.
In general you must choose between two options:
Individual insert/update operations in a retry loop; or
Locking the table and doing batch merge
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.
This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.
Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
BEGIN;
UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;
-- Remember, this is WRONG. Do NOT COPY IT.
INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.
This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE a TEMPORARY table
COPY or bulk-insert the new data into the temp table
LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.
Do an UPDATE ... FROM of existing records using the values in the temp table;
Do an INSERT of rows that don't already exist in the target table;
COMMIT, releasing the lock.
For example, for the example given in the question, using multi-valued INSERT to populate the temp table:
BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;
INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;
COMMIT;
Related reading
UPSERT wiki page
UPSERTisms in Postgres
Insert, on duplicate update in PostgreSQL?
http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
Upsert with a transaction
Is SELECT or INSERT in a function prone to race conditions?
SQL MERGE on the PostgreSQL wiki
Most idiomatic way to implement UPSERT in Postgresql nowadays
What about MERGE?
SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.
It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.
Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE in MySQL
MERGE from MS SQL Server (but see above about MERGE problems)
MERGE from Oracle (but see above about MERGE problems)

Here are some examples for insert ... on conflict ... (pg 9.5+) :
Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict do nothing;`
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict(id)
do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict on constraint dummy_pkey
do update set name = 'new_name', size = 4;

I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
do $$
begin
insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
update testtable set somedata = 'Joe' where id = 2;
end $$;
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.

SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert

MERGE in PostgreSQL v. 15
Since PostgreSQL v. 15, is possible to use MERGE command. It actually has been presented as the first of the main improvements of this new version.
It uses a WHEN MATCHED / WHEN NOT MATCHED conditional in order to choose the behaviour when there is an existing row with same criteria.
It is even better than standard UPSERT, as the new feature gives full control to INSERT, UPDATE or DELETE rows in bulk.
MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
INSERT (customer_id, balance)
VALUES (t.customer_id, t.transaction_value)

WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
Tested on Postgresql 9.3

Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
import itertools as it
from functools import partial
from operator import itemgetter
from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts
Now a couple helper functions
def chunk(content, chunksize=None):
"""Groups data into chunks each with (at most) `chunksize` items.
https://stackoverflow.com/a/22919323/408556
"""
if chunksize:
i = iter(content)
generator = (list(it.islice(i, chunksize)) for _ in it.count())
else:
generator = iter([content])
return it.takewhile(bool, generator)
def gen_resources(records):
"""Yields a dictionary if the record's id already exists, a row object
otherwise.
"""
ids = {item[0] for item in session.query(Posts.id)}
for record in records:
is_row = hasattr(record, 'to_dict')
if is_row and record.id in ids:
# It's a row but the id already exists, so we need to convert it
# to a dict that updates the existing record. Since it is duplicate,
# also yield True
yield record.to_dict(), True
elif is_row:
# It's a row and the id doesn't exist, so no conversion needed.
# Since it's not a duplicate, also yield False
yield record, False
elif record['id'] in ids:
# It's a dict and the id already exists, so no conversion needed.
# Since it is duplicate, also yield True
yield record, True
else:
# It's a dict and the id doesn't exist, so we need to convert it.
# Since it's not a duplicate, also yield False
yield Posts(**record), False
And finally the upsert function
def upsert(data, chunksize=None):
for records in chunk(data, chunksize):
resources = gen_resources(records)
sorted_resources = sorted(resources, key=itemgetter(1))
for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
items = [g[0] for g in group]
if dupe:
_upsert = partial(session.bulk_update_mappings, Posts)
else:
_upsert = session.add_all
try:
_upsert(items)
session.commit()
except IntegrityError:
# A record was added or deleted after we checked, so retry
#
# modify accordingly by adding additional exceptions, e.g.,
# except (IntegrityError, ValidationError, ValueError)
db.session.rollback()
upsert(items)
except Exception as e:
# Some other error occurred so reduce chunksize to isolate the
# offending row(s)
db.session.rollback()
num_items = len(items)
if num_items > 1:
upsert(items, num_items // 2)
else:
print('Error adding record {}'.format(items[0]))
Here's how you use it
>>> data = [
... {'id': 1, 'text': 'updated post1'},
... {'id': 5, 'text': 'updated post5'},
... {'id': 1000, 'text': 'new post1000'}]
...
>>> upsert(data)
The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).

Memory efficient way of fetching postgresql uniqueue dates?

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.

If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.

You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

Efficient way to run select query for millions of data

I want to run various select query 100 million times and I have aprox. 1 million rows in a table. Therefore, I am looking for the fastest method to run all these select queries.
So far I have tried three different methods, and the results were similar.
The following three methods are, of course, not doing anything useful, but are purely for comparing performance.
first Method:
for i in range (100000000):
cur.execute("select id from testTable where name = 'aaa';")
second method:
cur.execute("""PREPARE selectPlan AS
SELECT id FROM testTable WHERE name = 'aaa' ;""")
for i in range (10000000):
cur.execute("""EXECUTE selectPlan ;""")
third method:
def _data(n):
cur = conn.cursor()
for i in range (n):
yield (i, 'test')
sql = """SELECT id FROM testTable WHERE name = 'aaa' ;"""
cur.executemany(sql, _data(10000000))
And the table is created like this:
cur.execute("""CREATE TABLE testTable ( id int, name varchar(1000) );""")
cur.execute("""CREATE INDEX indx_testTable ON testTable(name)""")
I thought that using the prepared statement functionality would really speed up the queries, but as it seems like this will not happen, I thought you could give me a hint on other ways of doing this.

This sort of benchmark is unlikely to produce any useful data, but the second method should be fastest, as once the statement is prepared it is stored in memory by the database server. Further calls to repeat the query do not require the text of the query to be transmitted, so saving a small about of time.
This is likely to be moot as the query is very small (likely the same quantity of packets over the wire as repeating sending the query text), and the query cache will serve the same data for every request.

What's the purpose of retrieving such amount of data at once? I don't know your situation, but I'd definitely page the results using limit and offset. Take a look at:
7.6. LIMIT and OFFSET

If you just want to benchmark SQL all on it's own and not mix Python into the equation try pgbench.
http://developer.postgresql.org/pgdocs/postgres/pgbench.html
Also what is your goal here?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

purging a huge data mysql table using python - python

Related

Sql Select statement Optimization

Optimizing an Update statement with many records in SQLAlchemy

Bulk upsert (insert-update) a csv in postgres [duplicate]

Memory efficient way of fetching postgresql uniqueue dates?

Efficient way to run select query for millions of data

Categories

Resources