My python code produces a table with weeks as columns and rows as urls accessed. To get the data for each cell a query on a mysql database is executed. The code runs very slowly. I've added indexes to the mysql tables and this has not really helped. I thought it was because i was building the html table code with concatenation but even using a list and join has not fixed the speed. The code runs slowly in both django (using an additional database connection) and standalone python. Any help of speeding this up would be appreciated.
example query that to called from a loop:
def get_postcounts(week):
pageviews = 0
cursor = connections['olap'].cursor()
sql = "SELECT SUM(F.pageview) AS pageviews FROM fact_coursevisits F INNER JOIN dim_dates D ON F.Date_Id = D.Id WHERE D.date_week=%d;" % (week)
row_count = cursor.execute(sql);
result = cursor.fetchall()
for row in result:
if row[0] is not None:
pageviews = int(row[0])
cursor.close()
return pageviews
it could be because of the number of queries that you are executing(if you are having to call this method a lot).
i would suggest querying view count and the week over a certain period in one single query and read off the results.
Related
Having a strange issue with pymysql and python. I have a table where date_rec is one of 3 columns composing a primary key. If I do this select, it takes forever to get the result
query = f"SELECT * FROM string WHERE date_rec BETWEEN {date_before} AND {date_after} ORDER BY date_rec"
with connection.cursor() as cursor:
cursor.execute(query)
result = cursor.fetchone()
for row in result:
print(row)
However if I add a limit of 5000, it works superfast, even though there are only 1290 records to be found. The 5000 number doesn't matter... 50,000 fixes the problem exactly the same way (just as fast). As long as it's more than 1290, I get all the records.
query = f"SELECT * FROM string WHERE date_rec BETWEEN {date_before} AND {date_after} ORDER BY date_rec LIMIT 5000"
with connection.cursor() as cursor:
cursor.execute(query)
result = cursor.fetchone()
for row in result:
print(row)
Can someone explain what's happening here and how to make the first case work as fast as the second? Thanks.
EDIT:
3 columns compose primary key:
date_rec
customer_number
order_number
So I did explain on SQL workbench and got this:
Limit-less query
Query with 5000 limit
So Mysql wasn't using the index for whatever reason. Putting in "USE INDEX(PRIMARY)" inside the query fixed the problem.
I have an explanation as to why adding a LIMIT clause speeds up the query, but if you want to tune the query, then consider adding the following index:
CREATE INDEX idx ON string (date_rec);
This index will let MySQL quickly filter off records not inside the date range, and it also provides the ordering needed in the ORDER BY clause.
Using Python looping through a number of SQL Server databases creating tables using select into, but when I run the script nothing happens i.e. no error messages and the tables have not been created. Below is an extract example of what I am doing. Can anyone advise?
df = [] # dataframe of database names as example
for i, x in df.iterrows():
SQL = """
Drop table if exists {x}..table
Select
Name
Into
{y}..table
From
MainDatabase..Details
""".format(x=x['Database'],y=x['Database'])
cursor.execute(SQL)
conn.commit()
Looks like your DB driver doesn't support multiple statements behavior, try to split your query to 2 single statements one with drop and other with select:
for i, x in df.iterrows():
drop_sql = """
Drop table if exists {x}..table
""".format(x=x['Database'])
select_sql = """
Select
Name
Into
{y}..table
From
MainDatabase..Details
""".format(x=x['Database'], y=x['Database'])
cursor.execute(drop_sql)
cursor.execute(select_sql)
cursor.commit()
And second tip, your x=x['Database'] and y=x['Database'] are the same, is this correct?
I have a table with 4million rows and I use psycopg2 to execture a:
SELECT * FROM ..WHERE query
I haven't heard before of the server side cursor and I am reading its a good practice when you expect lots of results.
I find the documentation a bit limited and I have some basic questions.
First I declare the server-side cursor as:
cur = conn.cursor('cursor-name')
then I execute the query as:
cur.itersize = 10000
sqlstr = "SELECT clmn1, clmn2 FROM public.table WHERE clmn1 LIKE 'At%'"
cur.execute(sqlstr)
My question is: What do I do now? How do I get the results?
Do I iterate through the rows as:
row = cur.fetchone()
while row:
row = cur.fetchone()
or I use fetchmany() and I do this:
row = cur.fetchmany(10)
But in the second case how can I "scroll" the results?
Also what is the point of itersize?
Psycopg2 has a nice interface for working with server side cursors. This is a possible template to use:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 20000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
# process row
The code above creates the connection and automatically places the query result into a server side cursor. The value itersize sets the number of rows that the client will pull down at a time from the server side cursor. The value you use should balance number of network calls versus memory usage on the client. For example, if your result count is three million, an itersize value of 2000 (the default value) will result in 1500 network calls. If the memory consumed by 2000 rows is light, increase that number.
When using for row in cursor you are of course working with one row at a time, but Psycopg2 will prefetch itersize rows at a time for you.
If you want to use fetchmany for some reason, you could do something like this:
while True:
rows = cursor.fetchmany(100)
if len(rows) > 0:
for row in rows:
# process row
else:
break
This usage of fetchmany will not trigger a network call to the server for more rows until the prefetched batch has been exhausted. (This is a convoluted example that provides nothing over the code above, but demonstrates how to use fetchmany should there be a need.)
I tend to do something like this when I don't want to load millions of rows at once. You can turn a program into quite a memory hog if you load millions of rows into memory. Especially if you're making python domain objects out of those rows or something like that. I'm not sure if the uuid4 in the name is necessary, but my thought is that I want individual server side cursors that don't overlap if two processes make the same query.
from uuid import uuid4
import psycopg2
def fetch_things() -> Iterable[MyDomainObject]:
with psycopg2.connect(database_connection_string) as conn:
with conn.cursor(name=f"my_name_{uuid4()}") as cursor:
cursor.itersize = 500_000
query = "SELECT * FROM ..."
cursor.execute(query)
for row in cursor:
yield MyDomainObject(row)
I'm interested if anyone knows if this creates a storage problem on the SQL server or anything like that.
Additionally to cur.fetchmany(n) you can use PostgreSQL cursors:
cur.execute("declare foo cursor for select * from generate_series(1,1000000)")
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# ...
cur.execute("fetch forward 100 from foo")
rows = cur.fetchall()
# and so on
I am working on a program to clone rows in my database from one user to another. It works my selecting the rows, editing a few values and then inserting them back.
I also need to store the newly inserted rowIDs with their existing counterparts so I can clone some other link tables later on.
My code looks like the following:
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
selector = con.cursor(prepared=True)
insertor = con.cursor(prepared=True)
user_map = {}
selector.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', selector.column_names)
for row in selector:
curr_row = Row._make(row)
new_row = curr_row._replace(userID=None, companyID=95)
insertor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = insertor.lastrowid
selector.close()
insertor.close()
When running this code, I get the following error:
mysql.connector.errors.InternalError: Unread result found
I'm assuming this is because I am trying to run an INSERT while I am still looping over the SELECT, but I thought using two cursors would fix that. Why do I still get this error with multiple cursors?
I found a solution using fetchall(), but I was afraid that would use too much memory as there could be thousands of results returned from the SELECT.
import mysql.connector
from collections import namedtuple
con = mysql.connector.connect(host='127.0.0.1')
cursor = con.cursor(prepared=True)
user_map = {}
cursor.execute('SELECT * FROM users WHERE companyID = ?', (56, ))
Row = namedtuple('users', cursor.column_names)
for curr_row in map(Row._make, cursor.fetchall()):
new_row = curr_row._replace(userID=None, companyID=95)
cursor.execute('INSERT INTO users VALUES(?,?,?,?)', tuple(new_row))
user_map[curr_row.userID] = cursor.lastrowid
cursor.close()
This works, but it's not very fast. I was thinking that not using fetchall() would be quicker, but it seems if I do not fetch the full result set then MySQL yells at me.
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Is there a way to insert rows while looping over a result set without fetching the entire result set?
Yes. Use two MySQL connections: one for reading and the other for writing.
The performance impact isn't too bad, as long as you don't have thousands of instances of the program trying to connect to the same MySQL server.
One connection is reading a result set, and the other is inserting rows to the end of the same table, so you shouldn't have a deadlock. It would be helpful if the WHERE condition you use to read the table could explicitly exclude the rows you're inserting, if there's a way to tell the new rows apart from the old rows.
At some level, the performance impact of two connections doesn't matter because you don't have much choice. The only other way to do what you want to do is slurp the whole result set into RAM in your program, close your reading cursor, and then write.
I have a table called "unprocessed" where I want to read 2000 rows, send them over HTTP to another server and then insert the rows into a "processed" table and remove them from the "unprocessed" table.
My python code roughly looks like this:
db = MySQLdb.connect("localhost","username","password","database" )
# prepare a cursor object using cursor() method
cursor = db.cursor()
# Select all the records not yet sent
sql = "SELECT * from unprocessed where SupplierIDToUse = 'supplier1' limit 0, 2000"
cursor.execute(sql)
results = cursor.fetchall()
for row in results:
id = row[0]
<code is here here for sending to other server - it takes about 1/2 a second>
if sentcorrectly="1":
sql = "INSERT into processed (id, dateprocessed) VALUES ('%s', NOW()')" % (id)
try:
inserted = cursor.execute(sql)
except:
print "Failed to insert"
if inserted:
print "Inserted"
sql = "DELETE from unprocessed where id = '%s'" % (id)
try:
deleted = cursor.execute(sql)
except:
print "Failed to delete id from the unprocessed table, even though it was saved in the processed table."
db.close()
sys.exit(0)
I want to be able to run this code concurrently so that I can increase the speed of sending these records to the other server over HTTP.
At the moment if I try and run the code concurrently I get multiple copies of the same data sent top the other server and saved into the the "processed" table as the select query is getting the same id's in multiple instances of the code.
How can I lock the records when I select them and then process each record as a row before moving them to the "processed" table?
The table was MyISAM but I've converted to innoDB today as I realise there's probably a way of locking the records better with innoDB.
Based off your comment reply.
One of two solutions would be a client side python master process to collect the record ID's for all 2000 records and then split that up into chunks to be processed by sub workers.
Short version, your choices are delegate the work or rely on a possibly tricky asset locking mechanism. I would recommend the former approach as it can scale up with the aid of a message queue.
delegate logic would use multiprocessing
import multiprocessing
records = get_all_unprocessed_ids()
pool = multiprocessing.Pool(5) #create 5 workers
pool.map(process_records, records)
That would create 2000 tasks and run 5 tasks at a time or you can split records into chunks, using a solution outlined here
How do you split a list into evenly sized chunks?
pool.map(process_records, chunks(records, 100))
would create 20 lists of 100 records that would be processed in batches of 5
Edit:
syntax error - signature is map(func, iterable[, chunksize]) and I left out the argument for func.