We are trying to populate a database with Python and Django with random numbers, but we have a lot of rows to go through, and it takes like 20 minutes to carry out that task.
This is our code. We have 210000 rows to go through
def populate(request):
all_accounts = Account.objects.all()
count = 0
for account in all_accounts:
account.avg_deal_size = round(random.randint(10, 200000), 2)
account.save()
print(f"Counter of accounts: {count}")
count += 1
Thank you!
Assuming you don't need any logic in .save(), or signals to be executed or such, just use SQL to have your RDBMS do the heavy lifting. This should execute in seconds if that.
from django.db import connection
def populate(request):
with connection.cursor() as cur:
cur.execute("UPDATE myapp_account SET avg_deal_size = round(10 + random() * 190000, 2)")
(res,) = cur.fetchone()
print(f"{res} rows affected.")
Related
I currently have a list of id's approx. of size 10,000. I need to update all rows in the mySQL table which have an id in the inactive_ids list that you see below. I need to change their active status to 'No' which is a column in the mySQL table.
I am using mysql.connector python library.
When I run the code below, it is taking about 0.7 seconds to execute each iteration in the for loop. Thats about a 2 hour run time for all 10,000 id's to be changed. Is there a more optimal/quicker way to do this?
# inactive_ids are unique strings something like shown below
# inactive_ids = ['a9okeoko', 'sdfhreaa', 'xsdfasy', ..., 'asdfad']
# initialize connection
mydb = mysql.connector.connect(
user="REMOVED",
password="REMOVED",
host="REMOVED",
database="REMOVED"
)
# initialize cursor
mycursor = mydb.cursor(buffered=True)
# Function to execute multiple lines
def alter(state, msg, count):
result = mycursor.execute(state, multi=True)
result.send(None)
print(str(count), ': ', msg, result)
count += 1
return count
# Try to execute, throw exception if fails
try:
count = 0
for Id in inactive_ids:
# SAVE THE QUERY AS STRING
sql_update = "UPDATE test_table SET Active = 'No' WHERE NoticeId = '" + Id + "'"
# ALTER
count = alter(sql_update, "done", count)
# commits all changes to the database
mydb.commit()
except Exception as e:
mydb.rollback()
raise e
Do it with a single query that uses IN (...) instead of multiple queries.
placeholders = ','.join(['%s'] * len(inactive_ids))
sql_update = f"""
UPDATE test_table
SET Active = 'No'
WHERE NoticeId IN ({placeholders})
"""
mycursor.execute(sql_update, inactive_ids)
I've a function in my code which is as follows:
async def register():
db = connector.connect(host='localhost',user='root',password='root',database='testing')
cursor = db.cursor()
cursor.execute('LOCK TABLES Data WRITE;')
cursor.execute('SELECT Total_Reg FROM Data;')
data = cursor.fetchall()
reg = data[0][0]
print(reg)
if reg >= 30:
print("CLOSED!")
return
await asyncio.sleep(1)
cursor.execute('UPDATE Data SET Total_Reg = Total_Reg + 1 WHERE Id = 1')
cursor.execute('COMMIT;')
print("REGISTERED!")
db.close()
In case of multiple instances of this register function running at the same time, there is an unexpected infinite loop occurs blocking my entire code. Why is that so? Also, if it's a deadlock [I assume] then why my program is not raising any error? Please tell me why is this happening? And what can be done to prevent this issue?
A much simpler construct:
db = connector.connect(...
cursor = db.cursor()
if cursor.execute('UPDATE Data SET Total_Reg = Total_Reg + 1 WHERE Id = 1 AND Total_Reg < 30'):
print("REGISTERED!")
else:
print("CLOSED!")
db.close()
So:
Don't use LOCK TABLES, not until you understand transactions, and then rarely
Use the SQL to enforce the constraints you want
Use the return value to see if any rows where changed
Don't use sleep statements
I have to get the recently updated data from database. For the purpose of solving it, I have saved the last read row number into shelve of python. The following code works for a simple query like select * from rows. My code is:
from pyodbc import connect
from peewee import *
import random
import shelve
import connection
d = shelve.open("data.shelve")
db = SqliteDatabase("data.db")
class Rows(Model):
valueone = IntegerField()
valuetwo = IntegerField()
class Meta:
database = db
def CreateAndPopulate():
db.connect()
db.create_tables([Rows],safe=True)
with db.atomic():
for i in range(100):
row = Rows(valueone=random.randrange(0,100),valuetwo=random.randrange(0,100))
row.save()
db.close()
def get_last_primay_key():
return d.get('max_row',0)
def doWork():
query = "select * from rows" #could be anything
conn = connection.Connection("localhost","","SQLite3 ODBC Driver","data.db","","")
max_key_query = "SELECT MAX(%s) from %s" % ("id", "rows")
max_primary_key = conn.fetch_one(max_key_query)[0]
print "max_primary_key " + str(max_primary_key)
last_primary_key = get_last_primay_key()
print "last_primary_key " + str(last_primary_key)
if max_primary_key == last_primary_key:
print "no new records"
elif max_primary_key > last_primary_key:
print "There are some datas"
optimizedQuery = query + " where id>" + str(last_primary_key)
print query
for data in conn.fetch_all(optimizedQuery):
print data
d['max_row'] = max_primary_key
# print d['max_row']
# CreateAndPopulate() # to populate data
doWork()
While the code will work for a simple query without where clause, but the query can be anything from simple to complex, having joins and multiple where clauses. If so, then the portion where I'm adding where will fail. How can I get only last updated data from database whatever be the query?
PS: I cannot modify database. I just have to fetch from it.
Use an OFFSET clause. For example:
SELECT * FROM [....] WHERE [....] LIMIT -1 OFFSET 1000
In your query, replace 1000 with a parameter bound to your shelve variable. That will skip the top "shelve" number of rows and only grab newer ones. You may want to consider a more robust refactor eventually, but good luck.
I have written a small app that uses mysql to get a list of products that need updating on our magento website.
Python then actions these updates and marks the product in the db as complete.
My Original code (pseudo to show the overview)
class Mysqltools:
def get_products():
db = pymysql.connect(host= .... )
mysqlcursor = db.cursor(pymysql.cursors.DictCursor)
sql = select * from x where y = z
mysqlcursor.execute(sql % (z))
rows = mysqlcursor.fetchall()
mysqlcursor.close()
db.close
return rows
def write_products(sku, name, id):
db = pymysql.connect(host= .... )
mysqlcursor = db.cursor(pymysql.cursors.DictCursor)
sql = update table set sku = sku, name = name, id = id.....
mysqlcursor.execute(sql % (sku, name, id))
mysqlcursor.close()
db.close
This was working ok, but on each db connection string we were getting a pause.
I did a bit of research and did the following:
class Mysqltools:
def __init__():
self.db = pymysql.connect(host= .... )
def get_products():
mysqlcursor = self.db.cursor(pymysql.cursors.DictCursor)
sql = select * from x where y = z
mysqlcursor.execute(sql % (z))
rows = mysqlcursor.fetchall()
mysqlcursor.close()
def write_products(sku, name, id):
mysqlcursor = self.db.cursor(pymysql.cursors.DictCursor)
sql = update table set sku = sku, name = name, id = id.....
mysqlcursor.execute(sql % (sku, name, id))
mysqlcursor.close()
db.commit()
This has a MASSIVE speed improvement. However, it would only do a successful get_products on the first iteration, once it was called a second time, it was finding 0 products to update, even though performing the same SQL on the db would show a number of rows returned.
Am I doing something wrong with the connections ?
I have also tried moving the db = outside of the class and referencing it but that still gives the same issue.
UPDATE
Doing some testing, and if I remove the DictCursor from the cursor I can get the correct rows returned each time (I've just created a quick loop to keep checking for records)
Is the DictCursor doing something I am unaware of ?
** UPDATE 2 **
I've removed the DictCursor, and tried the following.
Create a while True loop which calls my get_product method.
In MySQL change some rows so that they should be found.
If I go from having 0 possible rows to find, then change some so they should be found, my code just shows 0 found, and loops stating this.
If I got from having x possible rows to find, then change it to 0 in mysql, my code continues to loop showing the x possible rows.
Ok, the answer to this is as follows:
db = pymysql.connect(host=.... user=... )
class MySqlTools:
def get_products():
mysqlcursor = db.cursor(pymysql.cursors.DictCursor)
sql = select * from x where y = z
mysqlcursor.execute(sql % (z))
rows = mysqlcursor.fetchall()
mysqlcursor.close()
db.commit()
This then allows you to re-use the db connection and remove the overhead of creating and closing a connection each and every time.
In testing, downloading 500 orders from our website and writing them to a db went from 16minutes to <3 minutes.
I have 2 tables; one is users and the other records user actions. I want to count the number of actions per user and record this in the users table. There are ~100k user and the following code takes 6 hours! There must be a better way!
def calculate_invites():
sql_db.execute("SELECT id, uid FROM users")
for row in sql_db:
id = row['id']
uid = row['uid']
sql1 = "SELECT COUNT(1) FROM actions WHERE uid = %s"
sql_db.execute(sql1, uid)
count_actions = sql_db.fetchone()["COUNT(1)"]
sql = "UPDATE users SET count_actions=%s WHERE uid=%s"
sql_db.execute(sql, (count_actions, uid))
You can do this all as one statement:
update users
set count_actons = (select count(*) from actions a where a.uid = users.uid)
No for loop. No multiple queries. Do in SQL what you can do in SQL. Generally looping over rows is something you want to do in the database rather than in the application.
Offered only as an alternative since Gordon's answer is probably faster:
update users
from (
select uid, count(*) as num_actions
from actions
group by uid
) x
set count_actions = x.num_actions
where users.uid=x.uid