in my example, I have a list of 3000 words to search in a database, so with this pyhton code (I am using django) the operation takes a lot of time and is done in 3.5 minutes with the average of 120ms for a querie. is there a method to speed up the execution of these queries with a script python using threads or something like that?
def my_custom_sql(ent):
for e in ent:
with connection.cursor() as cursor:
entity=e.replace("'","''")
cursor.execute("SELECT object FROM yagofacts WHERE subject='{}' AND object LIKE '<wordnet_%>';".format(entity))
row = cursor.fetchall()
print(row)
Modify your method like the code shown below. It will use the same cursor to fire all the queries and hence will be faster.
with connection.cursor() as cursor:
for e in ent:
entity=e.replace("'","''")
cursor.execute("SELECT object FROM yagofacts WHERE subject='{}' AND object LIKE '<wordnet_%>';".format(entity))
row = cursor.fetchall()
print(row)
If this is not helping, then try connection pooling. It may help in your case.
Related
Now, I have a study about python sqlite3 database. I think it is very simple problem but not allow next step. Could help me?
There is print OK on vscode terminal, but not revised to DB file. I'm searching several times but I can not fix it.
If I execute the code, it not sorting on DB files.
import sqlite3
conn = sqlite3.connect('sqliteDB1.db')
cursor = conn.cursor()
cursor.execute("SELECT * FROM member")
temp123 = cursor. fetchall()
print(temp123)
cursor.execute("SELECT * FROM member ORDER BY -code")
temp321 = cursor.fetchall()
conn.commit
print(temp321)
conn.close()
A select statement just returns data from a database, it will not modify it. Moreover, tables in SQL databases are inherently unordered sets. They have no intrinsic value, and you should never rely on the order of the rows that happens to be returned unless you explicitly sort it with an order by clause.
I am trying to query a large data (10 million rows) and try to prevent out of memory, but not familiar with Python and confused with different opinions regarding the execute(), cursor iterator and fetchone()
Am I right to assume that cursor.execute() does not load all data into memory and only when I call fetchone() then it will load 1 row of data
from mysql.connector import MySQLConnection
def query():
conn = MySQLConnection(host=conf['host'],
conf['port'],
conf['user'],
conf['password'],
conf['database'])
cursor = conn.cursor(buffered=True)
cursor.execute('SELECT * FROM TABLE') # 10 million rows
does this cursor iterator does the same with fetchone() ?
for row in cursor:
print(row)
is my code snippet is safe to handle 10 million rows of data? if not, how can I safely iterate the data without out of memory?
My first suggestion is to use from mysql.connector import connect, which by the default will use the C extension (CMySQLConnection), instead of from mysql.connector import MySQLConnection (pure Python).
If you for some reason want to use the pure Python version, you can pass use_pure=True in connect()
The second suggestion is to paginate the results, if you use a buffered cursor, it will fetch the entire result set from the server. I don't know if you want that with 10M rows.
Here's some references:
https://dev.mysql.com/doc/refman/8.0/en/limit-optimization.html
https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursorbuffered.html
Taken from MySQL documentation:
The fetchone() method is used by fetchall() and fetchmany(). It is also used when a cursor is used as an iterator.
The following example shows two equivalent ways to process a query result. The first uses fetchone() in a while loop, the second uses the cursor as an iterator:
# Using a while loop
cursor.execute("SELECT * FROM employees")
row = cursor.fetchone()
while row is not None:
print(row)
row = cursor.fetchone()
# Using the cursor as iterator
cursor.execute("SELECT * FROM employees")
for row in cursor:
print(row)
It also stated that:
You must fetch all rows for the current query before executing new statements using the same connection.
If you are worried about performance issues you should use fetchmany(n) in a while loop until you fetch all of the results like so:
'An iterator that uses fetchmany to keep memory usage down'
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
This behavior adheres to PEP249, which describes how and which methods db connectors should implement. A partial answer is given in this thread.
Basically the implementation of fetchall vs fetchmany vs fetchone would be up to the developers of the library depending on the database capabilities, but it would make sense, in the case of fetchmany and fetchone, that the unfetched/remaining results would be kept server side, until requested by another call or destruction of cursor object.
So in conclusion I think it is safe to assume calling execute method does not, in this case(mysqldb), dump all the data from the query to memory.
I have a table with 4million rows and I use psycopg2 to execture a:
SELECT * FROM ..WHERE query
I haven't heard before of the server side cursor and I am reading its a good practice when you expect lots of results.
I find the documentation a bit limited and I have some basic questions.
First I declare the server-side cursor as:
cur = conn.cursor('cursor-name')
then I execute the query as:
cur.itersize = 10000
sqlstr = "SELECT clmn1, clmn2 FROM public.table WHERE clmn1 LIKE 'At%'"
cur.execute(sqlstr)
My question is: What do I do now? How do I get the results?
Do I iterate through the rows as:
row = cur.fetchone()
while row:
row = cur.fetchone()
or I use fetchmany() and I do this:
row = cur.fetchmany(10)
But in the second case how can I "scroll" the results?
Also what is the point of itersize?
Psycopg2 has a nice interface for working with server side cursors. This is a possible template to use:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
The code above creates the connection and automatically places the query result into a server side cursor. The value itersize sets the number of rows that the client will pull down at a time from the server side cursor. The value you use should balance number of network calls versus memory usage on the client. For example, if your result count is three million, an itersize value of 2000 (the default value) will result in 1500 network calls. If the memory consumed by 2000 rows is light, increase that number.
When using for row in cursor you are of course working with one row at a time, but Psycopg2 will prefetch itersize rows at a time for you.
If you want to use fetchmany for some reason, you could do something like this:
while True:
rows = cursor.fetchmany(100)
if len(rows) > 0:
for row in rows:
# process row
else:
break
This usage of fetchmany will not trigger a network call to the server for more rows until the prefetched batch has been exhausted. (This is a convoluted example that provides nothing over the code above, but demonstrates how to use fetchmany should there be a need.)
I tend to do something like this when I don't want to load millions of rows at once. You can turn a program into quite a memory hog if you load millions of rows into memory. Especially if you're making python domain objects out of those rows or something like that. I'm not sure if the uuid4 in the name is necessary, but my thought is that I want individual server side cursors that don't overlap if two processes make the same query.
from uuid import uuid4
import psycopg2
def fetch_things() -> Iterable[MyDomainObject]:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name=f"my_name_{uuid4()}") as cursor:
cursor.itersize = 500_000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
yield MyDomainObject(row)
I'm interested if anyone knows if this creates a storage problem on the SQL server or anything like that.
Additionally to cur.fetchmany(n) you can use PostgreSQL cursors:
cur.execute("declare foo cursor for select * from generate_series(1,1000000)")
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# ...
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# and so on
I am trying to write a simple Python script that gets data from an API, stores it in a MySQL database, and performs some calculations on that data. I try fetch all data from a table where I just inserted some, but that query keeps returning None.
Part that doesn't work:
import MySQLdb
db = MySQLdb.connect("localhost", "stijn", "password", "GW2")
curs = db.cursor()
curs.execute("select gw2_id, naam from PrijzenMats")
for record in curs.fetchall():
curs2 = db.cursor()
curs2.execute("insert into MaterialPrijzenLogs(mat,prijs,tijd) values(%s, %s, %s)", (record[1], prijs, tijd))
db.commit()
curs2.execute("select prijs from MaterialPrijzenLogs")
top10 = len(curs2.fetchall())/10
print(str(len(curs2.fetchall())))
That last print keeps giving 0, even when I populate the table before running the script.
Full code
I solved the problem. Apparently when you call fetchall() it doesn't just get the data from the cursor like a normal getter in Java would do, but it also deletes the data from the cursor. In my code I called fetchall() first to initialize a variable, and after that I tried to print the length of curs2.fetchall(), which had become 0 at that point. This can be easily solved by adding something like myList = curs2.fetchall() directly after curs2.execute("select prijs from MaterialPrijzenLogs") and using the myList variable in the rest of the code instead of curs2.fetchall(). I did not include the declaration of that top10 variable in the code example in my original question because I thought it had nothing to do with the problem. I edited the question so future readers can easily understand the problem.
I have very simple mysql query as following:
db = getDB()
cursor = db.cursor()
cursor.execute('select * from users')
results = cursor.fetchall()
for row in results:
process(row)
Suppose users table has 1 billion records, the process method for one record takes 10ms.
The above code will finish fetching all of the data to client side and then starting process method. It really waste time. Should I do query and process parallel please?
So I'd like to change fetchall() to fetchmany() and start a new thread for process the retrieved result when cursor starting to query new result.