MySQLdb is extremely slow with large result sets - python

I executed the following query both in phpMyAdmin & MySQLdb (python).
SELECT *, (SELECT CONCAT(`id`, '|', `name`, '|', `image_code`)
FROM `model_artist` WHERE `id` = `artist_id`) as artist_data,
FIND_IN_SET("metallica", `searchable_words`) as find_0
FROM `model_song` HAVING find_0
phpMyAdmin said the query took 2ms.
My python code said that using MySQLdb the query took 848ms (without even fetching the results).
The python code:
self.db = MySQLdb.connect(host="localhost", user="root", passwd="", db="ibeat")
self.cur = self.db.cursor()
millis = lambda: time.time() * 1000
start_time = millis()
self.cur.execute_cmd("""SELECT *, (SELECT CONCAT(`id`, '|', `name`, '|', `image_code`)
FROM `model_artist` WHERE `id` = `artist_id`) as artist_data,
FIND_IN_SET("metallica", `searchable_words`) as find_0
FROM `model_song` HAVING find_0""")
print millis() - start_time

If you expect an SQL query to have a large result set which you then plan to iterate over record-by-record, then you may want to consider using the MySQLdb SSCursor instead of the default cursor. The default cursor stores the result set in the client, whereas the SSCursor stores the result set in the server. Unlike the default cursor, the SSCursor will not incur a large initial delay if all you need to do is iterate over the records one-by-one.
You can find a bit of example code on how to use the SSCursor here.
For example, try:
import MySQLdb.cursors
self.db = MySQLdb.connect(host="localhost", user="root", passwd="", db="ibeat",
cursorclass = MySQLdb.cursors.SSCursor)
(The rest of the code can remain the same.)

PHPMyAdmin places a limit on all queries so you don't return large result sets in the interface. So if your query normally returns 1,000,000 rows, and PHPMyAdmin reduces that to 1,000 (or whatever the default is), then you would have to expect a lot longer processing time when Python grabs or even queries the entire result set.
Try placing a limit in Python that matches the limit on PHPMyAdmin to compare times.

Related

Ending a SELECT transaction psycopg2 and postgres

I am executing a number of SELECT queries on a postgres database using psycopg2, but I am getting ERROR: Out of shared memory. It suggests that I should increase max_locks_per_transaction., but this confuses me because each SELECT query is operating on only one table, and max_locks_per_transaction is already set to 512, 8 times the default.
I am using TimescaleDB, which could be the result of a larger than normal number of locks (one for each chunk rather than one for each table, maybe), but this still can't explain running out when so many are allowed. I'm assuming what is happening here is that all the queries are all being run as part of one transaction.
The code I am using looks something as follows.
db = DatabaseConnector(**connection_params)
tables = db.get_table_list()
for table in tables:
result = db.query(f"""SELECT a, b, COUNT(c) FROM {table} GROUP BY a, b""")
print(result)
Where db.query is defined as:
def query(self, sql):
with self._connection.cursor() as cur:
cur.execute(sql)
return_value = cur.fetchall()
return return_value
and self._connection is:
self._connection = psycopg2.connect(**connection_params)
Do I need to explicitly end the transaction in some way to free up locks? And how can I go about doing this in psycopg2? I would have assumed that there was an implicit end to the transaction when the cursor is closed on __exit__. I know if I was inserting or deleting rows I would use COMMIT at the end, but it seems strange to use as I am not changing the table.
UPDATE: When I explicitly open and close the connection in the loop, the error does not show. However, I assume there is a better way to end the transaction after each SELECT than this.

Load existing sqlite db into memory/ execute and close

I have a function in python which connects to sqlite DB which has 20k rows and just executes a simple select query as below
def viewdata(mul):
conn = sqlite3.connect("mynew.db")
cursor = conn.cursor()
cursor.execute(("SELECT ad,abd,acd,ard FROM allrds WHERE mul<=?ORDER BY mul DESC LIMIT 1"),(mul,))
data = [i for i in cursor.fetchall()]
conn.close()
return data
its kind of slow, so i want to move this into in memory Database of SQLite, how can i copy this existing DB to in memory DB and make a connection and fetch the data and close it once the operations are over. Is there anything different i need to do when connecting to memory databases? are the select queries executed the same way like we do for on disk DB? Can someone please give me an example function

Why is 'executemany' so slow compared to just doing an 'IN' query?

My MySQL table schema is:
CREATE DATABASE test_db;
USE test_db;
CREATE TABLE test_table (
id INT AUTO_INCREMENT,
last_modified DATETIME NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
When I run the following benchmark script, I get:
b1: 20.5559301376
b2: 0.504406929016
from timeit import timeit
import MySQLdb
ids = range(1000)
query_1 = "update test_table set last_modified=UTC_TIMESTAMP() where id=%(id)s"
query_2 = "update test_table set last_modified=UTC_TIMESTAMP() where id in (%s)" % ", ".join(('%s', ) * len(ids))
db = MySQLdb.connect(host="localhost", user="some_user", passwd="some_pwd", db="test_db")
def b1():
curs = db.cursor()
curs.executemany(query_1, ids)
db.close()
def b2():
curs = db.cursor()
curs.execute(query_2, ids)
db.close()
print "b1: %s" % str(timeit(lambda:b1(), number=30))
print "b2: %s" % str(timeit(lambda:b2(), number=30))
Why is there such a large difference between executemany and the IN clause?
I'm using Python 2.6.6 and MySQL-python 1.2.3.
The only relevant question I could find was - Why is executemany slow in Python MySQLdb?, but it isn't really what I'm after.
executemany repeatedly goes back and forth to the MySQL server, which then needs to parse the query, perform it, and return results. This is perhaps 10 times as slow as doing everything in a single SQL statement, even if it is more complex.
However, for INSERT, this says that it will do the smart thing and construct a multi-row INSERT for you, thereby being efficient.
Hence, IN(1,2,3,...) is much more efficient than UPDATE;UPDATE;UPDATE...
If you have a sequence of ids, then even better would be to say WHERE id BETWEEN 1 and 1000. This is because it can simply scan the rows rather than looking up each one from scratch. (I am assuming id is indexed, probably as the PRIMARY KEY.)
Also, you are probably running with the settings that make each insert/update/delete into its own "transaction". This adds a lot of overhead to each UPDATE. And it is probably not desirable in this case. I suspect you want the entire 1000-row update to be atomic.
Bottom line: Use executemany only for (a) INSERTs or (b) statements that must be run individually.

How to use server side cursors with psycopg2

I have a table with 4million rows and I use psycopg2 to execture a:
SELECT * FROM ..WHERE query
I haven't heard before of the server side cursor and I am reading its a good practice when you expect lots of results.
I find the documentation a bit limited and I have some basic questions.
First I declare the server-side cursor as:
cur = conn.cursor('cursor-name')
then I execute the query as:
cur.itersize = 10000
sqlstr = "SELECT clmn1, clmn2 FROM public.table WHERE clmn1 LIKE 'At%'"
cur.execute(sqlstr)
My question is: What do I do now? How do I get the results?
Do I iterate through the rows as:
row = cur.fetchone()
while row:
row = cur.fetchone()
or I use fetchmany() and I do this:
row = cur.fetchmany(10)
But in the second case how can I "scroll" the results?
Also what is the point of itersize?
Psycopg2 has a nice interface for working with server side cursors. This is a possible template to use:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
The code above creates the connection and automatically places the query result into a server side cursor. The value itersize sets the number of rows that the client will pull down at a time from the server side cursor. The value you use should balance number of network calls versus memory usage on the client. For example, if your result count is three million, an itersize value of 2000 (the default value) will result in 1500 network calls. If the memory consumed by 2000 rows is light, increase that number.
When using for row in cursor you are of course working with one row at a time, but Psycopg2 will prefetch itersize rows at a time for you.
If you want to use fetchmany for some reason, you could do something like this:
while True:
rows = cursor.fetchmany(100)
if len(rows) > 0:
for row in rows:
# process row
else:
break
This usage of fetchmany will not trigger a network call to the server for more rows until the prefetched batch has been exhausted. (This is a convoluted example that provides nothing over the code above, but demonstrates how to use fetchmany should there be a need.)
I tend to do something like this when I don't want to load millions of rows at once. You can turn a program into quite a memory hog if you load millions of rows into memory. Especially if you're making python domain objects out of those rows or something like that. I'm not sure if the uuid4 in the name is necessary, but my thought is that I want individual server side cursors that don't overlap if two processes make the same query.
from uuid import uuid4
import psycopg2
def fetch_things() -> Iterable[MyDomainObject]:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name=f"my_name_{uuid4()}") as cursor:
cursor.itersize = 500_000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
yield MyDomainObject(row)
I'm interested if anyone knows if this creates a storage problem on the SQL server or anything like that.
Additionally to cur.fetchmany(n) you can use PostgreSQL cursors:
cur.execute("declare foo cursor for select * from generate_series(1,1000000)")
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# ...
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# and so on

Python MySQLdb doesn't wait for the result

I am trying to run some querys that needs to create some temporary tables and then returns a result set, but i am unable to do that with MySQLdb api.
I already dig something about this issue like here but without success.
My query is like this:
create temporary table tmp1
select * from table1;
alter tmp1 add index(somefield);
create temporary table tmp2
select * from table2;
select * from tmp1 inner join tmp2 using(somefield);
This returns immediatly an empty result set. If i go to the mysql client and do a show full processlist i can see my queries executing. They take some minutes to complete.
Why cursor returns immediatly and don't wait to query to run.
If i try to run another query i have a "Commands out of sync; you can't run this command now"
I already tried to put my connection with autocommit to True
db = MySQLdb.connect(host='ip',
user='root',
passwd='pass',
db='mydb',
use_unicode=True
)
db.autocommit(True)
Or put every statement in is own cursor.execute() and between them db.commit() but without success too.
Can you help me to figure what is the problem? I know mysql don't support transactions for some operations like alter table, but why the api don't wait until everything is finished like it does with a select?
By the way i'm trying to do this on a ipython notebook.
I suspect that you're passing your multi-statement SQL string directly to the cursor.execute function. The thing is, each of the statements is a query in its own right so it's unclear what the result set should contain.
Here's an example to show what I mean. The first case is passing a semicolon set of statements to execute which is what I presume you have currently.
def query_single_sql(cursor):
print 'query_single_sql'
sql = []
sql.append("""CREATE TEMPORARY TABLE tmp1 (id int)""")
sql.append("""INSERT INTO tmp1 VALUES (1)""")
sql.append("""SELECT * from tmp1""")
cursor.execute(';'.join(sql))
print list(cursor.fetchall())
Output:
query_single_sql
[]
You can see that nothing is returned, even though there is clearly data in the table and a SELECT is used.
The second case is where each statement is executed as an independent query, and the results printed for each query.
def query_separate_sql(cursor):
print 'query_separate_sql'
sql = []
sql.append("""CREATE TEMPORARY TABLE tmp3 (id int)""")
sql.append("""INSERT INTO tmp3 VALUES (1)""")
sql.append("""SELECT * from tmp3""")
for query in sql:
cursor.execute(query)
print list(cursor.fetchall())
Output:
query_separate_sql
[]
[]
[(1L,)]
As you can see, we consumed the results of the cursor for each query and the final query has the results we expect.
I suspect that even though you've issued multiple queries, the API only has a handle to the first query executed and so immediately returns when the CREATE TABLE is done. I'd suggest serializing your queries as described in the second example above.

Categories