Django query / Iteration issue - python

I have a fairly noob question regarding iteration that I can't seem to get correct.
I have a table that houses a record for every monthly test a user completes, if they miss a month then there is no record in the table.
I want to pull the users history from the table then for each of the 12 months set a Y or N as to their completed status.
Here is my code:
def getSafetyHistory(self, id):
results = []
safety_courses = UserMonthlySafetyCurriculums.objects.filter(users_id=id).order_by('month_assigned')
for i in range(1, 13):
for s in safety_courses:
if s.month_assigned == i:
results.append('Y')
else:
results.append('N')
return results
So my ideal result would be a list with 12 entries, either Y or N
i.e results = [N,N,Y,N,N,Y,Y,Y,N,N,N,Y]
The query above returns 2 records for the user which is correct, but in my iteration I keep getting 24 entries, obviously due to the outter and inner loops, but I am not sure of the "pythonic" way I should be doing this without a ton of nested loops.

There are probably lots of ways to do this. Here is one idea.
It looks like you are only going to get records for courses that have been completed. So you could pre-build a list of 12 results, all set to no. Then after you query the database, you flip the ones to yes that correspond to the results you got.
results = ['N'] * 12 # prebuild results to all no
safety_courses = UserMonthlySafetyCurriculums.objects.filter(
users_id=id).order_by('month_assigned')
for course in safety_courses:
results[course.month_assigned - 1] = 'Y'
This assumes month_assigned is an integer between 1 and 12, as your code hints at.

Related

How to use Python to append dictionary based on csv dataset

I am trying to write a code based on a csv dataset of the number of passengers arriving at stations around multiple cities. My code needs to output the city with the most numbers of passengers arriving by finding the sum of arrivals across all stations in that city and output that number.
Currently my code outputs nothing for the city with most arrivals and -1 for the number of arrivals in that city.
I'm not sure what my error is. Please help!
This is my code:
cities = {}
is_first_line = True
for row in open("Passengers_Analysis.csv"):
if is_first_line:
is_first_line = False
else:
values = row.split(",")
city = values[3]
if city not in cities:
cities[city] = []
cities[city].append(city)
passengers = {}
for key in passengers:
passengers+=int(values[6])
max_city = ""
max_passengers = -1
for key in passengers:
if passengers[key] > max_passengers:
max_passengers = passengers[key]
max_city = key
print("The most popular city:", max_city)
print("The number of passengers in the scheduled period:", max_passengers)
Just from looking at your code, this is exactly the behaviour the code should produce.
In line 12 you declare an empty dictionary called "passengers".
In the lines 13 and 17 you create loops, which cycle over all elements in this dictionary. As the dictionary is empty, none of the loops gets executed and your max_city and max_passengers remain on their initial values.
At least the first loop is probably meant to run on your input data.
I recommend to clarify the flow of your programm first and then try to fix the loops.
And anyway, please provide a minimal, reproducible example.

Best way to deal with giant number of combinations python

I have a bunch of Twitter data (300 million messages from 450k users) and am trying to unravel a social network through #mentions. My end goal is to have a bunch of pairs where the first item is a pair of #mentions and the second item is the number of users who mention both people. For example: [(#sam, #kim), 25]. The order of the #mentions doesn't matter, so (#sam,#kim)=(#kim,#sam).
First I am creating a dictionary where the key is the user id and the value is a set of #mentions
for row in data:
user_id = int(row[1])
msg = str(unicode(row[0], errors='ignore'))
if user_id not in userData:
userData[user_id] = set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
else:
userData[user_id] |= set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
I then loop through the users and create a dictionary where the key is a tuple of #mentions and the values is the number of users who mention both:
for user in userData.keys():
if len(userData[user]) < MENTION_THRESHOLD:
continue
for ht in itertools.combinations(userData[user], 2):
if ht in hashtag_set:
hashtag_set[ht] += 1
else:
hashtag_set[ht] = 1
This second part is taking FOREVER to run. Is there a better way to run this and/or a better way to store this data?
Instead of trying to do all this stuff in-memory as you are now, I would suggest using generators to pipeline your data. Check out this slide deck from PyCon 2008 by David Beazely: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf
In particular, Part 2 has a number of examples of parsing big data that directly apply to what you want to do. By using generators, you can avoid most of the memory consumption you have now, and I would expect you to see significant performance improvements as a result.

Ordering objects by rating accounting for the number or ratings

I'm trying to do something similar to the first response in this SO question: SQL ordering by rating/votes, where resources may be rated (one rating per user per resource), but when ordering the resources based on their ratings, any resources with fewer than X separate ratings will appear below those with X or more.
I'm implementing this in Django and I'd very much prefer to avoid the use of raw query and keep within the Django model and query framework.
So far, this is what I have:
data = []
data_top = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_top:
data.append(d)
data_bottom = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_bottom:
data.append(d)
This all functions and returns the ordering by rating as I need, however, it doesn't feel very efficient - what with running 2 queries and looping over the results of each.
Is there a better way I can code this, either in a single query, or at least avoiding looping though each query set?
Any help much appreciated.
from itertools import chain
main_query = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating'))
data_top_query = main_query.exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data_bottom_query = main_query.exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data = list(chain(data_top_query, data_bottom_query))
Using itertools.chain is faster than looping each list and appending elements one by one
Also, the querysets will get evaluated when list is called on them (as they don't hit the database till then)
FYI, the above will hit the db twice when evaluated.
You're currently querying twice and iterating twice, but you can cut it down to one and one easily-just query for the items ordered by rating, then iterate like this:
data_top = []
data_bottom = []
data = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).order_by(order_by)
for d in data:
if data.rate_count >= settings.ORB_RESOURCE_MIN_RATINGS:
data_top.append(d)
else:
data_bottom.append(d)
data = data_top + data_bottom
This can also be done with the query only, by creating another aggregate column which contains the value rate_count < settings.ORB_RESOURCE_MIN_RATINGS (return 0 for values above or at the threshold, 1 for below) and sorting on (new_column, rating). Pretty sure this would require some custom SQL, but perhaps someone else knows otherwise.

Separating lists in a list through iteration

First off, this is a homework assignment, but I've been working on it for a week now and haven't made much headway. My goal for this function is to take a list of lists (each list contains data about a football player) and separate the lists based off of the teams which the players belong to. I also want to add up each player's data so that I wind up with one list for each team with all the player's stats combined.
Here's the code I have so far. The problem I'm currently running into is that some teams are printed multiple times with different data each time. Otherwise it appears to be working correctly. Also, we have the limitation imposed on us that we are not allowed to use classes.
def TopRushingTeam2010(team_info_2010): #running into trouble calculating the rusher rating for each team, it also prints out the same team multiple times but with different stats. And just not getting the right numbers and order.
total_yards = 0
total_TD = 0
total_rush = 0
total_fum = 0
#works mostly, but is returning some teams twice, with different stats each time, which
#should not be happening. so... yeah maybe fix that?
for item in team_info_2010:
team = item[0]
total_yards = item[2]
total_TD = item[3]
total_rush = item[1]
total_fum = item[4]
new_team_info_2010.append([team, total_yards, total_TD, total_rush, total_fum])
for other_item in team_info_2010:
if other_item[0] == team:
new_team_info_2010.remove([team, total_yards, total_TD, total_rush, total_fum])
total_yards = total_yards + other_item[2]
total_TD = total_TD + other_item[3]
total_rush = total_rush + other_item[1]
total_fum = total_fum + other_item[4]
new_team_info_2010.append([team, total_yards, total_TD, total_rush, total_fum])
Any help or tips as to which direction I should head, or if I'm even headed in the right direction?
One possible problem is that you are removing from team_info_2010 while you are iterating through the list. Try deleting that line of code. I don't see a clear reason why you would want to delete from team_info_2010 and behavior is often undefined when you modify an object while iterating through it. More specifically, try deleting the following line of code:
team_info_2010.remove(item)

How can I make my code more efficient?

I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)

Categories