Django MySQL Join to increase performance - python

I am searching for medical items
I have 10 areas with 10 tables each containing medical dispensaries and 10 more tables with each table consisting of medical items available at those dispensaries.
Now I am searching a medical item among all these dispenaries
for each area, I am getting objects
a_list = Washington_dispensaries.objects.all()
then I am searching whether this particular item is available in each dispensary
if category:
a_lists = []
for dispensary in a_list:
items = dispensary.washington_dispensaries_item_set.filter(item__product_type__name = category)
if items:
result_cat_items.append(items)
a_lists.append(dispensary)
a_list = a_lists
It is taking 15 seconds to complete this query for all 10 regions.
If I write code in PHP, I would use MySQL join to make the query faster. Now how do I query faster.

You're already doing two joins, to item and product_type. So just do one more: instead of getting the dispensaries separately, do it as part of the filter. It's hard to give exact syntax without seeing your models, but something like:
DispensaryItem.objects.filter(dispensary__location='Washington', item__product_type__name=category)

Related

Speed up python w/ sqlalchemy function

I have a function that populates a database table using python and sqlalchemy. The function is running fairly slowly right now, taking around 17 minutes. I think the main problem is I am looping through two large sets of data to build the new table. I have included the record count in the code below.
How can I speed this up? Should I try to convert the nested for loop into one big sqlalchemy query? I profiled this function with pycharm but am not sure I fully understand the results.
def populate(self):
"""Core function to populate positions."""
# get raw annotations with tag Org
# returns 11,659 records
organizations = model.session.query(model.Annotation) \
.filter(model.Annotation.tag == 'Org')\
.filter(model.Annotation.organization_id.isnot(None)).all()
# get raw annotations with tags Support or Oppose
# returns 2,947 records
annotations = model.session.query(model.Annotation) \
.filter((model.Annotation.tag == 'Support') | (model.Annotation.tag == 'Oppose')).all()
for org in organizations:
for anno in annotations:
# Org overlaps with Support or Oppose tag
# start and end columns are integers
if org.start >= anno.start and org.end <= anno.end:
position = model.Position()
# set to de-duplicated organization
position.organization_id = org.organization_id
position.disposition = anno.tag
# look up bill_id from document_bill table
document = model.session.query(model.document_bill)\
.filter_by(document_id=anno.document_id).first()
position.bill_id = document.bill_id
position.document_id = anno.document_id
model.session.add(position)
logging.info('org: {}, disposition: {}, bill: {}'.format(
position.organization_id, position.disposition, position.bill_id)
)
continue
logging.info('committing to database')
model.session.commit()
My bets, in order of descending probability:
Autocommit is ON, so you're waiting for disk.
The query inside the loop "document = model.session.query(model.document_bill)...." is slow (use EXPLAIN ANALYZE).
most of the time is actually spent printing logs to the terminal in the inner loop (you should profile)
model.session.add(position) is slow (no idea what that does)
(and this one should really be first) Could a SQL query like INSERT INTO SELECT do this in a couple tens of milliseconds? If so, why make a loop in the application?...

Django ignores select_related, makes more requests to fetch related objects

I'm making django api for an internship application and run into optimization problem, and while my previous problem was almost fixed, i've run into related problem. The rest of my code and initial problem are here:
I'm using select_related as can be seen here in my view:
#api_view(["GET"])
def bunnyList(request, vegetableType):
""" Displays heap-sorted list of bunnies, in decreasing order.
Takes word after list ("/list/xxx") as argument to determine
which vegetable list to display"""
if vegetableType in vegetablesChoices:
vegetables = Vegetable.objects.filter(vegetableType=vegetableType).select_related('bunny')
vegetables = list(vegetables)
if len(vegetables) == 0:
return Response({"No bunnies": "there is 0 bunnies with this vegetable"},
status=status.HTTP_204_NO_CONTENT)
heapsort(vegetables)
bunnies = [vegetable.bunny for vegetable in vegetables]
serialized = BunnySerializerPartial(bunnies, many=True)
return Response(serialized.data, status=status.HTTP_200_OK)
else:
raise serializers.ValidationError("No such vegetable. Available are: " + ", ".join(vegetablesChoices))
This should perform only one query, but as i can see in django debug toolbar is making one + 200 other (one for each vegetable object) in list comprehension, as if it it is completely ignoring the join in the select_related query.
The performed queries are:
SELECT ••• FROM "zajaczkowskiBoardApi_vegetable" INNER JOIN "zajaczkowskiBoardApi_bunny" ON ("zajaczkowskiBoardApi_vegetable"."bunny_id" = "zajaczkowskiBoardApi_bunny"."id") WHERE "zajaczkowskiBoardApi_vegetable"."vegetableType" = '''carrots'''
And this one slighly modified for all objects:
SELECT ••• FROM "zajaczkowskiBoardApi_vegetable" WHERE "zajaczkowskiBoardApi_vegetable"."bunny_id" = '153'
Thank You for any help on resolving this!
According to your code, BunnySerializerPartial needs to fetch all related vegetables for each bunny.
So select_related is not enough, you should write:
vegetables = Vegetable.objects\
.filter(vegetableType=vegetableType)\
.select_related('bunny')\
.prefetch_related('bunny__vegetables')
This way, a second query will be executed to fetch all vegetables related to all selected bunnies.

Best way to deal with giant number of combinations python

I have a bunch of Twitter data (300 million messages from 450k users) and am trying to unravel a social network through #mentions. My end goal is to have a bunch of pairs where the first item is a pair of #mentions and the second item is the number of users who mention both people. For example: [(#sam, #kim), 25]. The order of the #mentions doesn't matter, so (#sam,#kim)=(#kim,#sam).
First I am creating a dictionary where the key is the user id and the value is a set of #mentions
for row in data:
user_id = int(row[1])
msg = str(unicode(row[0], errors='ignore'))
if user_id not in userData:
userData[user_id] = set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
else:
userData[user_id] |= set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
I then loop through the users and create a dictionary where the key is a tuple of #mentions and the values is the number of users who mention both:
for user in userData.keys():
if len(userData[user]) < MENTION_THRESHOLD:
continue
for ht in itertools.combinations(userData[user], 2):
if ht in hashtag_set:
hashtag_set[ht] += 1
else:
hashtag_set[ht] = 1
This second part is taking FOREVER to run. Is there a better way to run this and/or a better way to store this data?
Instead of trying to do all this stuff in-memory as you are now, I would suggest using generators to pipeline your data. Check out this slide deck from PyCon 2008 by David Beazely: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf
In particular, Part 2 has a number of examples of parsing big data that directly apply to what you want to do. By using generators, you can avoid most of the memory consumption you have now, and I would expect you to see significant performance improvements as a result.

Ordering objects by rating accounting for the number or ratings

I'm trying to do something similar to the first response in this SO question: SQL ordering by rating/votes, where resources may be rated (one rating per user per resource), but when ordering the resources based on their ratings, any resources with fewer than X separate ratings will appear below those with X or more.
I'm implementing this in Django and I'd very much prefer to avoid the use of raw query and keep within the Django model and query framework.
So far, this is what I have:
data = []
data_top = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_top:
data.append(d)
data_bottom = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_bottom:
data.append(d)
This all functions and returns the ordering by rating as I need, however, it doesn't feel very efficient - what with running 2 queries and looping over the results of each.
Is there a better way I can code this, either in a single query, or at least avoiding looping though each query set?
Any help much appreciated.
from itertools import chain
main_query = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating'))
data_top_query = main_query.exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data_bottom_query = main_query.exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data = list(chain(data_top_query, data_bottom_query))
Using itertools.chain is faster than looping each list and appending elements one by one
Also, the querysets will get evaluated when list is called on them (as they don't hit the database till then)
FYI, the above will hit the db twice when evaluated.
You're currently querying twice and iterating twice, but you can cut it down to one and one easily-just query for the items ordered by rating, then iterate like this:
data_top = []
data_bottom = []
data = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).order_by(order_by)
for d in data:
if data.rate_count >= settings.ORB_RESOURCE_MIN_RATINGS:
data_top.append(d)
else:
data_bottom.append(d)
data = data_top + data_bottom
This can also be done with the query only, by creating another aggregate column which contains the value rate_count < settings.ORB_RESOURCE_MIN_RATINGS (return 0 for values above or at the threshold, 1 for below) and sorting on (new_column, rating). Pretty sure this would require some custom SQL, but perhaps someone else knows otherwise.

How can I make my code more efficient?

I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)

Categories