postgresql update functionality takes too long - python

I have a table called
products
Which holds columns
id, data
data here is a JSONB.
id is a unique ID.
I tried bulk adding 10k products it took me nearly 4 minutes.
With lower products update works just fine, but for huge # of products it takes a lot of time, how can I optimize this?
I am trying to bulk update 200k+ products, it's taking me more than 5 minutes right now.
updated_product_ids = []
for product in products:
new_product = model.Product(id, data=product['data'])
new_product['data'] = 'updated data'
new_product['id'] = product.get('id')
updated_product_ids.append(new_product)
def bulk_update(product_ids_arr):
def update_query(count):
return f"""
UPDATE pricing.products
SET data = :{count}
WHERE id = :{count + 1}
"""
queries = []
params = {}
count = 1
for sku in product_ids_arr:
queries.append(update_query(count))
params[str(count)] = json.dumps(sku.data)
params[str(count + 1)] = sku.id
count += 2
session.execute(';'.join(queries), params) #This is what takes so long..
bulk_update(updated_product_ids)
I thought using raw sql to execute this would be faster, but it's taking ALOT of time..
I am trying to update about only 8k products and it takes nearly 3 minutes or more..

Related

Importing database takes a lot of time

I am trying to import a table that contains 81462 rows in a dataframe using the following code:
sql_conn = pyodbc.connect('DRIVER={SQL Server}; SERVER=server.database.windows.net; DATABASE=server_dev; uid=user; pwd=pw')
query = "select * from product inner join brand on Product.BrandId = Brand.BrandId"
df = pd.read_sql(query, sql_conn)
And the whole process takes a very long time. I think that I am already 30-minutes in and it's still processing. I'd assume that this is not quite normal - so how else should I import it so the processing time is quicker?
Thanks to #RomanPerekhrest. FETCH NEXT imported everything within 1-2 minutes.
SELECT product.Name, brand.Name as BrandName, description, size FROM Product inner join brand on product.brandid=brand.brandid ORDER BY Name OFFSET 1 ROWS FETCH NEXT 80000 ROWS ONLY

Sqlalchemy mySQL optimize query

General:
I need to create a statistic tool from a given DB with many hundreds of thousands entries. So I never need to write to the DB, only get data.
Problem:
I have a user table, in my case i select 20k users (between two dates). Now I need to select only users, who at least spent money once (from these 20k).
To do so I have 3 different tables where the data is saved whether a user spent money. (So we work here with 4 tables in total):
User, Transaction_1, Transaction_2, Transaction_3
What I did so far:
In the model of the User class I have created a property which checks whether the user appears once in one of the Transaction tables:
#property
def spent_money_once(self):
spent_money_atleast_once = False
in_transactions = Transaction_1.query.filter(Transaction_1.user_id == self.id).first()
if in_transactions:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsVK = Transaction_2.query.filter(Transaction_2.user_id == self.id).first()
if in_transactionsVK:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsStripe = Transaction_3.query.filter(Transaction_3.user_id == self.id).first()
if in_transactionsStripe:
spent_money_atleast_once = True
return spent_money_atleast_once
return spent_money_atleast_once
Then I created two counters for male and female users, so I can count how many of these 20k users spent money at least once:
males_payed_atleast_once = 0
females_payed_atleast_once = 0
for male_user in male_users.all():
if male_user.spent_money_once is True:
males_payed_atleast_once += 1
for female_user in female_users.all():
if female_user.spent_money_once is True:
females_payed_atleast_once += 1
But this takes really long to calculate, arround 40-60 min. I have never worked with such huge data amounts, maybe it is normal?
Additional info:
In case you wonder how male_users and female_users look like:
# Note: is this even efficient, if all() completes the query than I need to store the .all() into variables, otherwise everytime I call .all() it takes time
global all_users
global male_users
global female_users
all_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date)
male_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "1")
female_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "2")
I am trying to save certain queries in global variables to improve performance.
I am using Python 3 | Flask | Sqlalchemy for this task. The DB is MySQL.
I tryed now a completely different approach and used join and it is now way faster, it completes the query in 10 sec, which took 60 min.~:
# males
paying_males_1 = male_users.join(Transaction_1, Transaction_1.user_id == Users.id).all()
paying_males_2 = male_users.join(Transaction_2, Transaction_2.user_id == Users.id).all()
paying_males_3 = male_users.join(Transaction_3, Transaction_3.user_id == Users.id).all()
males_payed_all = paying_males_1 + paying_males_2 + paying_males_3
males_payed_atleast_once = len(set(males_payed_all))
I am simply joining each table and use .all(), the results are simple lists. After that I am merging all lists and typecasting them to set. Now I have only the unique users. The last step is to count them by using len() on the set.
Assuming you need to aggregate the info of the 3 tables together before counting, this will be a bit faster:
SELECT userid, SUM(ct) AS total
FROM (
( SELECT userid, COUNT(*) AS ct FROM trans1 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans2 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans3 GROUP BY userid )
)
GROUP BY userid
HAVING total >= 2
Recommend you test this in the mysql commandline tool, then figure out how to convert it to Python 3 | Flask | Sqlalchemy
Funny thing about packages that "hide the database" --; you still need to understand how the database works if you are going to do anything non-trivial.

Fastest way to retrieve data from SQLite database

I have the following code:
import sqlite3
import numpy as np
conn = sqlite3.connect('test.db')
c = conn.cursor()
c.execute("SELECT timestamp FROM stockData");
times = np.array(c.fetchall())
times = times.reshape(len(times))
buys = []
sells = []
def price(time):
c.execute("SELECT aapl FROM stockData WHERE timestamp = :t", {'t':time});
return c.fetchone()[0]
def trader(time):
p = float(price(time))
if(p < 186):
buys.append(p)
if(p > 186.5):
sells.append(p)
vfunc = np.vectorize(trader)
vfunc(times[:1000])
This takes about a minute to execute, the reason I see for this is because I am inefficiently calling specific data points from my SQLite DB. I realize I could get around this by calling all relevant data points with something like the following:
SEARCH aapl FROM stockData WHERE aapl < 186;
But I am dead set on having my code loop through all timestamps one by one, as I want to maintain proper chronology.
My database is about 5 million rows long, thus the current method is not feasible. How can I most efficiently loop through and retrieve this data?

Listing database objects efficiently

I'm working on a page that lists companies and their employees. Employees have sales. These are saved in a database. Now I need to list all of them. My problem is that the current solution is not fast. One page load takes over 15 seconds.
Currently I have done the following:
companies = {}
employees = {}
for company in Company.objects.all():
sales_count = 0
sales_sum = 0
companies[company.id] = {}
companies[company.id]["name"] = company.name
for employee in company.employees.all():
employee_sales_count = 0
employee_sales_sum = 0
employees[employee.id] = {}
employees[employee.id]["name"] = employee.first_name + " " + employee.last_name
for sale in employee.sales.all():
employee_sales_count+= 1
employee_sales_sum += sale.total
employees[employee.id]["sales_count"] = employee_sales_count
employees[employee.id]["sales_sum"] = employee_sales_sum
sales_count += employee_sales_count
sales_sum += employee_sales_sum
companies[company.id]["sales_count"] = sales_count
companies[company.id]["sales_sum"] = sales_sum
I'm new to Python, not sure if this is a "pythonic" way to do things.
This makes 1500 queries to the database with 100 companies and some employees and sales for each. How should I improve my program to make it efficient?
Avoid nesting of database queries in loops - it's a fine way to performance hell! :-)
Since you're counting all sales for all employees I suggest building your employee and sales dicts on their own. Don't forget to import defaultdict and you may want to lookup how group by and suming/counting works in Django :-)
Lets see... this should give you a indication where to go from here:
# build employee dict
employee_qset = Employee.objects.all()
employees = defaultdict(dict)
for emp in employee_qset.iterator():
employees[emp.company_id][emp.id] = emp
# build sales dict
sales_qset = Sales.objects.all()
sales = defaultdict(dict)
for sale in sales_qset.iterator():
# you could do some calculations here, like sum, or better yet do sums via annotate and group_by in the database
sales[sale.employee_id][sale.id] = sale
# get companies
companies_qset = Companies.objects.all()
companies = {company.id: company for company in companies_qset.iterator()}
for company in companies.itervalues():
# assign employees, assign sales, etc.
pass

How to make an efficient query for extracting enteries of all days in a database in sets?

I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name

Categories