Sqlalchemy mySQL optimize query - python

General:
I need to create a statistic tool from a given DB with many hundreds of thousands entries. So I never need to write to the DB, only get data.
Problem:
I have a user table, in my case i select 20k users (between two dates). Now I need to select only users, who at least spent money once (from these 20k).
To do so I have 3 different tables where the data is saved whether a user spent money. (So we work here with 4 tables in total):
User, Transaction_1, Transaction_2, Transaction_3
What I did so far:
In the model of the User class I have created a property which checks whether the user appears once in one of the Transaction tables:
#property
def spent_money_once(self):
spent_money_atleast_once = False
in_transactions = Transaction_1.query.filter(Transaction_1.user_id == self.id).first()
if in_transactions:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsVK = Transaction_2.query.filter(Transaction_2.user_id == self.id).first()
if in_transactionsVK:
spent_money_atleast_once = True
return spent_money_atleast_once
in_transactionsStripe = Transaction_3.query.filter(Transaction_3.user_id == self.id).first()
if in_transactionsStripe:
spent_money_atleast_once = True
return spent_money_atleast_once
return spent_money_atleast_once
Then I created two counters for male and female users, so I can count how many of these 20k users spent money at least once:
males_payed_atleast_once = 0
females_payed_atleast_once = 0
for male_user in male_users.all():
if male_user.spent_money_once is True:
males_payed_atleast_once += 1
for female_user in female_users.all():
if female_user.spent_money_once is True:
females_payed_atleast_once += 1
But this takes really long to calculate, arround 40-60 min. I have never worked with such huge data amounts, maybe it is normal?
Additional info:
In case you wonder how male_users and female_users look like:
# Note: is this even efficient, if all() completes the query than I need to store the .all() into variables, otherwise everytime I call .all() it takes time
global all_users
global male_users
global female_users
all_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date)
male_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "1")
female_users = Users.query.filter(Users.date_added >= start_date, Users.date_added <= end_date, Users.gender == "2")
I am trying to save certain queries in global variables to improve performance.
I am using Python 3 | Flask | Sqlalchemy for this task. The DB is MySQL.

I tryed now a completely different approach and used join and it is now way faster, it completes the query in 10 sec, which took 60 min.~:
# males
paying_males_1 = male_users.join(Transaction_1, Transaction_1.user_id == Users.id).all()
paying_males_2 = male_users.join(Transaction_2, Transaction_2.user_id == Users.id).all()
paying_males_3 = male_users.join(Transaction_3, Transaction_3.user_id == Users.id).all()
males_payed_all = paying_males_1 + paying_males_2 + paying_males_3
males_payed_atleast_once = len(set(males_payed_all))
I am simply joining each table and use .all(), the results are simple lists. After that I am merging all lists and typecasting them to set. Now I have only the unique users. The last step is to count them by using len() on the set.

Assuming you need to aggregate the info of the 3 tables together before counting, this will be a bit faster:
SELECT userid, SUM(ct) AS total
FROM (
( SELECT userid, COUNT(*) AS ct FROM trans1 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans2 GROUP BY userid )
UNION ALL
( SELECT userid, COUNT(*) AS ct FROM trans3 GROUP BY userid )
)
GROUP BY userid
HAVING total >= 2
Recommend you test this in the mysql commandline tool, then figure out how to convert it to Python 3 | Flask | Sqlalchemy
Funny thing about packages that "hide the database" --; you still need to understand how the database works if you are going to do anything non-trivial.

Related

postgresql update functionality takes too long

I have a table called
products
Which holds columns
id, data
data here is a JSONB.
id is a unique ID.
I tried bulk adding 10k products it took me nearly 4 minutes.
With lower products update works just fine, but for huge # of products it takes a lot of time, how can I optimize this?
I am trying to bulk update 200k+ products, it's taking me more than 5 minutes right now.
updated_product_ids = []
for product in products:
new_product = model.Product(id, data=product['data'])
new_product['data'] = 'updated data'
new_product['id'] = product.get('id')
updated_product_ids.append(new_product)
def bulk_update(product_ids_arr):
def update_query(count):
return f"""
UPDATE pricing.products
SET data = :{count}
WHERE id = :{count + 1}
"""
queries = []
params = {}
count = 1
for sku in product_ids_arr:
queries.append(update_query(count))
params[str(count)] = json.dumps(sku.data)
params[str(count + 1)] = sku.id
count += 2
session.execute(';'.join(queries), params) #This is what takes so long..
bulk_update(updated_product_ids)
I thought using raw sql to execute this would be faster, but it's taking ALOT of time..
I am trying to update about only 8k products and it takes nearly 3 minutes or more..

Improve speed or find faster alternative to SQL Update

I have a 68m rows x 77 columns table (general_table) on a MySQL server that contains, amongst other things, user_id, user_name, date and media_channel.
There are rare instances (83k of them) where there is a user_id but not a user_name, we would find that the value for user_name is "-". I can get this information from the users_table table.
To update the values on general_table i use the following update function, but given the size of the table it takes a really long time so I'm looking for an alternative.
UPDATE
general_table as a,
users_table as b
SET a.user_name = b.user_name
where a.date > '2020-01-01'
and a.user_id = b.user_id
and a.media_channel = b.media_channel
and a.user_name = '-';
Answers using Pandas, PyMySQL or SQLAlchemy are also welcome
Keep in mind for those requesting an Explain function that that only works for SELECT queries not for updates.
For this query:
UPDATE general_table g
JOIN users_table u ON g.user_id = u.user_id AND g.media_channel = u.media_channel
SET g.user_name = u.user_name
WHERE g.date > '2020-01-01' AND g.user_name = '-'
You want indexes on general_table(user_name, date, user_id, media_channel) and on users_table(user_id, media_channel, user_name).
Note: It will still take some time to update 83k rows, so you might want to do this in batches.

If a condition is not fulfilled return data for that user, else another query

I have a table with these data:
ID, Name, LastName, Date, Type
I have to query the table for the user with ID 1.
Get the row, if the Type of that user is not 2, then return that user, else return all users that have the same LastName and Date.
What would be the most efficient way to do this ?
What I had done is :
query1 = SELECT * FROM clients where ID = 1
query2 = SELECT * FROM client WHERE LastName = %s AND Date= %s
And I execute the first query
cursor.execute(sql)
rows = cursor.fetchall()
for row in rows:
if(row['Type'] ==2 )
cursor.execute(sql2(row['LastName'], row['Date']))
Save results
else
results = rows?
Is there a more efficient way of doing this using Joins?
Example if I only have a left join, how would I also ask if the type of the user is 2 ?
And if there is multiple rows to be returned, how to assign them to an array of objects in python?
Just do two queries to avoid loops here:
q1 = """
SELECT c.* FROM clients c where c.ID = 1
"""
q2 = """
SELECT b.* FROM clients b
JOIN (SELECT c.* FROM
clients c
c.ID = 1
AND
c.Type = 2) a
ON
a.LastName = b.LastName
AND
a.Date = b.Date
"""
Then you can just execute both queries and you'll have all the desired results you want without the need for loops since your loop will execute n number of queries where n is equal to the number of rows that match as opposed to grabbing it all in one join in one pass. Without more specifics as the desired data structure of final results, as it seems you only care about saving the results, this should give you what you want.

Coalesce results in a QuerySet

I have the following models:
class Property(models.Model):
name = models.CharField(max_length=100)
def is_available(self, avail_date_from, avail_date_to):
# Check against the owner's specified availability
available_periods = self.propertyavailability_set \
.filter(date_from__lte=avail_date_from, \
date_to__gte=avail_date_to) \
.count()
if available_periods == 0:
return False
return True
class PropertyAvailability(models.Model):
de_property = models.ForeignKey(Property, verbose_name='Property')
date_from = models.DateField(verbose_name='From')
date_to = models.DateField(verbose_name='To')
rate_sun_to_thurs = models.IntegerField(verbose_name='Nightly rate: Sun to Thurs')
rate_fri_to_sat = models.IntegerField(verbose_name='Nightly rate: Fri to Sat')
rate_7_night_stay = models.IntegerField(blank=True, null=True, verbose_name='Weekly rate')
minimum_stay_length = models.IntegerField(default=1, verbose_name='Min. length of stay')
class Meta:
unique_together = ('date_from', 'date_to')
Essentially, each Property has its availability specified with instances of PropertyAvailability. From this, the Property.is_available() method checks to see if the Property is available during a given period by querying against PropertyAvailability.
This code works fine except for the following scenario:
Example data
Using the current Property.is_available() method, if I were to search for availability between the 2nd of Jan, 2017 and the 5th of Jan, 2017 it'd work because it matches #1.
But if I were to search between the 4th of Jan, 2017 and the 8th of Jan, 2017, it wouldn't return anything because the date range is overlapping between multiple results - it matches neither #1 or #2.
I read this earlier (which introduced a similar problem and solution through coalescing results) but had trouble writing that using Django's ORM or getting it to work with raw SQL.
So, how can I write a query (preferably using the ORM) that will do this? Or perhaps there's a better solution that I'm unaware of?
Other notes
Both avail_date_from and avail_date_to must match up with PropertyAvailability's date_from and date_to fields:
avail_date_from must be >= PropertyAvailability.date_from
avail_date_to must be <= PropertyAvailability.date_to
This is because I need to query that a Property is available within a given period.
Software specs
Django 1.11
PostgreSQL 9.3.16
My solution would be to check whether the date_from or the date_to fields of PropertyAvailability are contained in the period we're interested in. I do this using Q objects. As mentioned in the comments above, we also need to include the PropertyAvailability objects that encompass the entire period we're interested in. If we find more than one instance, we must check if the availability objects are continuous.
from datetime import timedelta
from django.db.models import Q
class Property(models.Model):
name = models.CharField(max_length=100)
def is_available(self, avail_date_from, avail_date_to):
date_range = (avail_date_from, avail_date_to)
# Check against the owner's specified availability
query_filter = (
# One of the records' date fields falls within date_range
Q(date_from__range=date_range) |
Q(date_to__range=date_range) |
# OR date_range falls between one record's date_from and date_to
Q(date_from__lte=avail_date_from, date_to__gte=avail_date_to)
)
available_periods = self.propertyavailability_set \
.filter(query_filter) \
.order_by('date_from')
# BEWARE! This might suck up a lot of memory if the number of returned rows is large!
# I do this because negative indexing of a `QuerySet` is not supported.
available_periods = list(available_periods)
if len(available_periods) == 1:
# must check if availability matches the range
return (
available_periods[0].date_from <= avail_date_from and
available_periods[0].date_to >= avail_date_to
)
elif len(available_periods) > 1:
# must check if the periods are continuous and match the range
if (
available_periods[0].date_from > avail_date_from or
available_periods[-1].date_to < avail_date_to
):
return False
period_end = available_periods[0].date_to
for available_period in available_periods[1:]:
if available_period.date_from - period_end > timedelta(days=1):
return False
else:
period_end = available_period.date_to
return True
else:
return False
I feel the need to mention though, that the database model does not guarantee that there are no overlapping PropertyAvailability objects in your database. In addition, the unique constraint should most likely contain the de_property field.
What you should be able to do is aggregate the data you wish to query against, and combine any overlapping (or adjacent) ranges.
Postgres doesn't have any way of doing this: it has operators for union and combining adjacent ranges, but nothing that will aggregate collections of overlapping/adjacent ranges.
However, you can write a query that will combine them, although how to do it with the ORM is not obvious (yet).
Here is one solution (left as a comment on http://schinckel.net/2014/11/18/aggregating-ranges-in-postgres/#comment-2834554302, and tweaked to combine adjacent ranges, which appears to be what you want):
SELECT int4range(MIN(LOWER(value)), MAX(UPPER(value))) AS value
FROM (SELECT value,
MAX(new_start) OVER (ORDER BY value) AS left_edge
FROM (SELECT value,
CASE WHEN LOWER(value) <= MAX(le) OVER (ORDER BY value)
THEN NULL
ELSE LOWER(value) END AS new_start
FROM (SELECT value,
lag(UPPER(value)) OVER (ORDER BY value) AS le
FROM range_test
) s1
) s2
) s3
GROUP BY left_edge;
One way to make this queryable from within the ORM is to put it in a Postgres VIEW, and have a model that references this.
However, it is worth noting that this queries the whole source table, so you may want to have filtering applied; probably by de_property.
Something like:
CREATE OR REPLACE VIEW property_aggregatedavailability AS (
SELECT de_property
MIN(date_from) AS date_from,
MAX(date_to) AS date_to
FROM (SELECT date_from,
date_to,
MAX(new_from) OVER (PARTITION BY de_property
ORDER BY date_from) AS left_edge
FROM (SELECT de_property,
date_from,
date_to,
CASE WHEN date_from <= MAX(le) OVER (PARTITION BY de_property
ORDER BY date_from)
THEN NULL
ELSE date_from
END AS new_from
FROM (SELECT de_property,
date_from,
date_to,
LAG(date_to) OVER (PARTITION BY de_property
ORDER BY date_from) AS le
FROM property_propertyavailability
) s1
) s2
) s3
GROUP BY de_property, left_edge
)
As an aside, you might want to consider using Postgres's date range objects, because then you can prevent start > finish (automatically), but also prevent overlapping periods for a given property, using exclusion constraints.
Finally, an alternative solution might be to have a derived table, that stores unavailability, based on taking the available periods, and reversing them. This makes writing the query simpler, as you can write a direct overlap, but negate (i.e., a property is available for a given period iff there are no overlapping unavailable periods). I do that in a production system for staff availability/unavailability, where many checks need to be made. Note that is a denormalised solution, and relies on trigger functions (or other updates) to ensure it is kept in sync.

How to make an efficient query for extracting enteries of all days in a database in sets?

I have a database that includes 440 days of several series with a sampling time of 5 seconds. There is also missing data.
I want to calculate the daily average. The way I am doing it now is that I make 440 queries and do the averaging afterward. But, this is very time consuming since for every query the whole database is searched for related entries. I imagine there must be a more efficient way of doing this.
I am doing this in python, and I am just learning sql. Here's the query section of my code:
time_cur = date_begin
Data = numpy.zeros(shape=(N, NoFields - 1))
X = []
nN = 0
while time_cur<date_end:
X.append(time_cur)
cur = con.cursor()
cur.execute("SELECT * FROM os_table \
WHERE EXTRACT(year from datetime_)=%s\
AND EXTRACT(month from datetime_)=%s\
AND EXTRACT(day from datetime_)=%s",\
(time_cur.year, time_cur.month, time_cur.day));
Y = numpy.array([0]*(NoFields-1))
n = 0.0
while True:
n = n + 1
row = cur.fetchone()
if row == None:
break
Y = Y + numpy.array(row[1:])
Data[nN][:] = Y/n
nN = nN + 1
time_cur = time_cur + datetime.timedelta(days=1)
And, my data looks like this:
datetime_,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14
2012-11-13-00:07:53,42,30,0,0,1,9594,30,218,1,4556,42,1482,42
2012-11-13-00:07:58,70,55,0,0,2,23252,55,414,2,2358,70,3074,70
2012-11-13-00:08:03,44,32,0,0,0,11038,32,0,0,5307,44,1896,44
2012-11-13-00:08:08,36,26,0,0,0,26678,26,0,0,12842,36,1141,36
2012-11-13-00:08:13,33,26,0,0,0,6590,26,0,0,3521,33,851,33
I appreciate your suggestions.
Thanks
Iman
I don't know the np function so I don't understand what are you averaging. If you show your table and the logic to get the average...
But this is how to get a daily average for a single column
import psycopg2
conn = psycopg2.connect('host=localhost4 port=5432 dbname=cpn')
cursor = conn.cursor()
cursor.execute('''
select
datetime::date as day,
avg(c1) as c1_average,
avg(c2) as c2_average
from os_table
where datetime between %s and %s
group by 1
order by 1
''',
(time_cur, date_end)
);
rs = cursor.fetchall()
conn.close()
for day in rs:
print day[0], day[1], day[2]
This answer uses SQL Server syntax - I am not sure how different PostgreSQL is - it should be fairly similar, you may find things like the DATEADD, DATEDIFF and CONVERT statements are different, (Actually, almost certainly the CONVERT statement - just convert the date to a varchar instead -I am just using it as a reportName, so it not vital) You should be able to follow the theory of this, even if the code doesn't run in PostgreSQL without tweaking.
First Create a Reports Table ( you will use this to link to the actual table you want to report on )
CREATE TABLE Report_Periods (
report_name VARCHAR(30) NOT NULL PRIMARY KEY,
report_start_date DATETIME NOT NULL,
report_end_date DATETIME NOT NULL,
CONSTRAINT date_ordering
CHECK (report_start_date <= report_end_date)
)
Next populate the report table with the dates you need to report on, there are many ways to do this - the method I've chosen here will only use the days you need, but you could create this with all dates you are ever likely to use, so you only have to do it once.
INSERT INTO Report_Periods (report_name, report_start_date, report_end_date)
SELECT CONVERT(VARCHAR, [DatePartOnly], 0) AS DateName,
[DatePartOnly] AS StartDate,
DATEADD(ms, -3, DATEADD(dd,1,[DatePartOnly])) AS EndDate
FROM ( SELECT DISTINCT DATEADD(DD, DATEDIFF(DD, 0, datetime_), 0) AS [DatePartOnly]
FROM os_table ) AS M
Note in SQL server, the smallest time allowed is 3 milliseconds - so the above statement adds 1 day, then subtracts 3 milliseconds to create a start and end datetime for a day. Again PostgrSQL may have different values
This means you can simply link the reports table back to your os_table to get averages, counts etc very simply
SELECT AVG(value) AS AvgValue, COUNT(value) AS NumValues, R.report_name
FROM os_table AS T
JOIN Report_Periods AS R ON T.datetime_>= R.report_start_date AND T.datetime_<= R.report_end_date
GROUP BY R.report_name

Categories