Having a strange issue with pymysql and python. I have a table where date_rec is one of 3 columns composing a primary key. If I do this select, it takes forever to get the result
query = f"SELECT * FROM string WHERE date_rec BETWEEN {date_before} AND {date_after} ORDER BY date_rec"
with connection.cursor() as cursor:
cursor.execute(query)
result = cursor.fetchone()
for row in result:
print(row)
However if I add a limit of 5000, it works superfast, even though there are only 1290 records to be found. The 5000 number doesn't matter... 50,000 fixes the problem exactly the same way (just as fast). As long as it's more than 1290, I get all the records.
query = f"SELECT * FROM string WHERE date_rec BETWEEN {date_before} AND {date_after} ORDER BY date_rec LIMIT 5000"
with connection.cursor() as cursor:
cursor.execute(query)
result = cursor.fetchone()
for row in result:
print(row)
Can someone explain what's happening here and how to make the first case work as fast as the second? Thanks.
EDIT:
3 columns compose primary key:
date_rec
customer_number
order_number
So I did explain on SQL workbench and got this:
Limit-less query
Query with 5000 limit
So Mysql wasn't using the index for whatever reason. Putting in "USE INDEX(PRIMARY)" inside the query fixed the problem.
I have an explanation as to why adding a LIMIT clause speeds up the query, but if you want to tune the query, then consider adding the following index:
CREATE INDEX idx ON string (date_rec);
This index will let MySQL quickly filter off records not inside the date range, and it also provides the ordering needed in the ORDER BY clause.
Related
I have a table with 4million rows and I use psycopg2 to execture a:
SELECT * FROM ..WHERE query
I haven't heard before of the server side cursor and I am reading its a good practice when you expect lots of results.
I find the documentation a bit limited and I have some basic questions.
First I declare the server-side cursor as:
cur = conn.cursor('cursor-name')
then I execute the query as:
cur.itersize = 10000
sqlstr = "SELECT clmn1, clmn2 FROM public.table WHERE clmn1 LIKE 'At%'"
cur.execute(sqlstr)
My question is: What do I do now? How do I get the results?
Do I iterate through the rows as:
row = cur.fetchone()
while row:
row = cur.fetchone()
or I use fetchmany() and I do this:
row = cur.fetchmany(10)
But in the second case how can I "scroll" the results?
Also what is the point of itersize?
Psycopg2 has a nice interface for working with server side cursors. This is a possible template to use:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
The code above creates the connection and automatically places the query result into a server side cursor. The value itersize sets the number of rows that the client will pull down at a time from the server side cursor. The value you use should balance number of network calls versus memory usage on the client. For example, if your result count is three million, an itersize value of 2000 (the default value) will result in 1500 network calls. If the memory consumed by 2000 rows is light, increase that number.
When using for row in cursor you are of course working with one row at a time, but Psycopg2 will prefetch itersize rows at a time for you.
If you want to use fetchmany for some reason, you could do something like this:
while True:
rows = cursor.fetchmany(100)
if len(rows) > 0:
for row in rows:
# process row
else:
break
This usage of fetchmany will not trigger a network call to the server for more rows until the prefetched batch has been exhausted. (This is a convoluted example that provides nothing over the code above, but demonstrates how to use fetchmany should there be a need.)
I tend to do something like this when I don't want to load millions of rows at once. You can turn a program into quite a memory hog if you load millions of rows into memory. Especially if you're making python domain objects out of those rows or something like that. I'm not sure if the uuid4 in the name is necessary, but my thought is that I want individual server side cursors that don't overlap if two processes make the same query.
from uuid import uuid4
import psycopg2
def fetch_things() -> Iterable[MyDomainObject]:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name=f"my_name_{uuid4()}") as cursor:
cursor.itersize = 500_000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
yield MyDomainObject(row)
I'm interested if anyone knows if this creates a storage problem on the SQL server or anything like that.
Additionally to cur.fetchmany(n) you can use PostgreSQL cursors:
cur.execute("declare foo cursor for select * from generate_series(1,1000000)")
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# ...
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# and so on
My python code produces a table with weeks as columns and rows as urls accessed. To get the data for each cell a query on a mysql database is executed. The code runs very slowly. I've added indexes to the mysql tables and this has not really helped. I thought it was because i was building the html table code with concatenation but even using a list and join has not fixed the speed. The code runs slowly in both django (using an additional database connection) and standalone python. Any help of speeding this up would be appreciated.
example query that to called from a loop:
def get_postcounts(week):
pageviews = 0
cursor = connections['olap'].cursor()
sql = "SELECT SUM(F.pageview) AS pageviews FROM fact_coursevisits F INNER JOIN dim_dates D ON F.Date_Id = D.Id WHERE D.date_week=%d;" % (week)
row_count = cursor.execute(sql);
result = cursor.fetchall()
for row in result:
if row[0] is not None:
pageviews = int(row[0])
cursor.close()
return pageviews
it could be because of the number of queries that you are executing(if you are having to call this method a lot).
i would suggest querying view count and the week over a certain period in one single query and read off the results.
I have an issue executing a query using python's pgdb using multiple case sensitive columns. The results of most queries return a python list, but if I issue a query against a table specifying multiple case sensitive columns the result is a string.
For example I have a table in a PostgreSQL database with 3 case-sensitve boolean columns named:
(colA, colB, debug)
If I'm interested in selecting more than one column I receive a raw string result from the query:
query = 'SELECT ("colA", debug) FROM my_table;"
or
query = 'SELECT ("colA", "colB") FROM my_table;"
the query will return:
cursor.execute(query)
cursor.fetchone()
['(f,f)']
Issuing the following query:
query = "SELECT * FROM my_table;"
cursor.execute(query)
cursor.fetchone()
results in the expected python list:
[False, False, False]
and if I specify one column in quotes the result is expected:
query = 'SELECT ("colA") FROM my_table;'
cursor.execute(query)
cursor.fetchone()
[False]
I'm hoping someone can point me in the right direction to understand why I receive a raw string when selecting multiple case-sensitve columns. I could issue multiple queries to solve my problem, or just SELECT * but to maintain robust code and protect myself against future changes to the table, I'd prefer to specify my columns.
If you enclose multiple coumns in parentheses you form an ad-hoc row-type, resulting in a single value returned.
SELECT ("colA", "colB") FROM my_table;
Drop the parens to get individual columns:
SELECT "colA", "colB" FROM my_table;
If in doubt about double-quoting, read the chapter about identifiers in the manual. My standing advice is to use legal, lower-case identifiers only in PostgreSQL.
i just started out with programmming and wrote a few lines of code in pyscripter using sqlite3.
The table "gather" is created beforehand. i then select certain rows from "gather" to put them into another table. i try to sort this table by a specific column 'date'. But it doesn't seem to work. it doesn't give me an error message or something like that. It's just not sorted. If i try the same command (SELECT * FROM matches ORDER BY date) in sqlitemanager, it works fine on the exact same table! what is the problem here? i googled quite some time, but i don't find a solution. it's proobably something stupid i'm missing..
as i said i'm a total newbie. i guess you all break out in tears looking at the code. so if you have any tips how i can shorten the code or make it faster or whatever, you're very welcome :) (but everything works fine except the above mentioned part.)
import sqlite3
connection = sqlite3.connect("gather.sqlite")
cursor1 = connection.cursor()
cursor1.execute('Drop table IF EXISTS matches')
cursor1.execute('CREATE TABLE matches(date TEXT, team1 TEXT, team2 TEXT)')
cursor1.execute('INSERT INTO matches (date, team1, team2) SELECT * FROM gather WHERE team1=? or team2=?, (a,a,))
cursor1.execute("SELECT * FROM matches ORDER BY date")
connection.commit()
OK, I think I understand your problem. First of all: I'm not sure if that commit call is necessary at all. However, if it is, you'll definitely want that to be before your select statement. 'connection.commit()' is essentially saying, commit the changes I just made to the database.
Your second issue is that you are executing the select query but never actually doing anything with the results of the query.
try this:
import sqlite3
connection = sqlite3.connect("gather.sqlite")
cursor1 = connection.cursor()
cursor1.execute('Drop table IF EXISTS matches')
cursor1.execute('CREATE TABLE matches(date TEXT, team1 TEXT, team2 TEXT)')
cursor1.execute('INSERT INTO matches (date, team1, team2) SELECT * FROM gather WHERE team1=? or team2=?, (a,a,))
connection.commit()
# directly iterate over the results of the query:
for row in cursor1.execute("SELECT * FROM matches ORDER BY date"):
print row
you are executing the query, but never actually retrieving the results. There are two ways to do this with sqlite3: One way is the way I showed you above, where you can just use the execute statement directly as an iteratable object.
The other way is as follows:
import sqlite3
connection = sqlite3.connect("gather.sqlite")
cursor1 = connection.cursor()
cursor1.execute('Drop table IF EXISTS matches')
cursor1.execute('CREATE TABLE matches(date TEXT, team1 TEXT, team2 TEXT)')
cursor1.execute('INSERT INTO matches (date, team1, team2) SELECT * FROM gather WHERE team1=? or team2=?, (a,a,))
connection.commit()
cursor1.execute("SELECT * FROM matches ORDER BY date")
# fetch all means fetch all rows from the last query. here you put the rows
# into their own result object instead of directly iterating over them.
db_result = cursor1.fetchall()
for row in db_result:
print row
Try moving the commit before the SELECT * (I'm not sure 100% that this is an issue) You then just need to fetch the results of the query :-) Add a line like res = cursor1.fetchall() after you've executed the SELECT. If you want to display them like in sqlitemanager, add
for hit in res:
print '|'.join(hit)
at the bottom.
Edit: To address your issue of storing the sort order to the table:
I think what you're looking for is something like a clustered index. (Which doesn't actually sort the values in th table, but comes close; see here).
SQLIte doesn't have such such indexes, but you can simulate them by actually ordering the table. You can only do this once, as you're inserting the data. You would need an SQL command like the following:
INSERT INTO matches (date, team1, team2)
SELECT * FROM gather
WHERE team1=? or team2=?
ORDER BY date;
instead of the one you currently use.
See point 4 here, which is where I got the idea.
I want to run various select query 100 million times and I have aprox. 1 million rows in a table. Therefore, I am looking for the fastest method to run all these select queries.
So far I have tried three different methods, and the results were similar.
The following three methods are, of course, not doing anything useful, but are purely for comparing performance.
first Method:
for i in range (100000000):
cur.execute("select id from testTable where name = 'aaa';")
second method:
cur.execute("""PREPARE selectPlan AS
SELECT id FROM testTable WHERE name = 'aaa' ;""")
for i in range (10000000):
cur.execute("""EXECUTE selectPlan ;""")
third method:
def _data(n):
cur = conn.cursor()
for i in range (n):
yield (i, 'test')
sql = """SELECT id FROM testTable WHERE name = 'aaa' ;"""
cur.executemany(sql, _data(10000000))
And the table is created like this:
cur.execute("""CREATE TABLE testTable ( id int, name varchar(1000) );""")
cur.execute("""CREATE INDEX indx_testTable ON testTable(name)""")
I thought that using the prepared statement functionality would really speed up the queries, but as it seems like this will not happen, I thought you could give me a hint on other ways of doing this.
This sort of benchmark is unlikely to produce any useful data, but the second method should be fastest, as once the statement is prepared it is stored in memory by the database server. Further calls to repeat the query do not require the text of the query to be transmitted, so saving a small about of time.
This is likely to be moot as the query is very small (likely the same quantity of packets over the wire as repeating sending the query text), and the query cache will serve the same data for every request.
What's the purpose of retrieving such amount of data at once? I don't know your situation, but I'd definitely page the results using limit and offset. Take a look at:
7.6. LIMIT and OFFSET
If you just want to benchmark SQL all on it's own and not mix Python into the equation try pgbench.
http://developer.postgresql.org/pgdocs/postgres/pgbench.html
Also what is your goal here?