Best way to deal with giant number of combinations python - python

I have a bunch of Twitter data (300 million messages from 450k users) and am trying to unravel a social network through #mentions. My end goal is to have a bunch of pairs where the first item is a pair of #mentions and the second item is the number of users who mention both people. For example: [(#sam, #kim), 25]. The order of the #mentions doesn't matter, so (#sam,#kim)=(#kim,#sam).
First I am creating a dictionary where the key is the user id and the value is a set of #mentions
for row in data:
user_id = int(row[1])
msg = str(unicode(row[0], errors='ignore'))
if user_id not in userData:
userData[user_id] = set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
else:
userData[user_id] |= set([ tag.lower() for tag in msg.split() if tag.startswith("#") ])
I then loop through the users and create a dictionary where the key is a tuple of #mentions and the values is the number of users who mention both:
for user in userData.keys():
if len(userData[user]) < MENTION_THRESHOLD:
continue
for ht in itertools.combinations(userData[user], 2):
if ht in hashtag_set:
hashtag_set[ht] += 1
else:
hashtag_set[ht] = 1
This second part is taking FOREVER to run. Is there a better way to run this and/or a better way to store this data?

Instead of trying to do all this stuff in-memory as you are now, I would suggest using generators to pipeline your data. Check out this slide deck from PyCon 2008 by David Beazely: http://www.dabeaz.com/generators-uk/GeneratorsUK.pdf
In particular, Part 2 has a number of examples of parsing big data that directly apply to what you want to do. By using generators, you can avoid most of the memory consumption you have now, and I would expect you to see significant performance improvements as a result.

Related

Fixing a meeting room function schedule with double and triple bookings to determine space usage

I need to calculate the total amount of time each group uses a meeting space. But the data set has double and triple booking, so I think I need to fix the data first. Disclosure: My coding experience consists solely of working through a few Dataquest courses, and this is my first stackoverflow posting, so I apologize for errors and transgressions.
Each line of the data set contains the group ID and a start and end time. It also includes the booking type, ie. reserved, meeting, etc. Generally, the staff reserve a space for the entire period, which would create a single line, and then add multiple lines for each individual function when the details are known. They should segment the original reserved line so it's only holding space in between functions, but instead they double book the space, so I need to add multiple lines for these interim RES holds, based on the actual holds.
Here's what the data basically looks like:
Existing data:
functions = [['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/06 17:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00]]
I've tried to iterate using a for loop:
for index, row in enumerate(functions):
last_row_index = len(functions) - 1
if index == last_row_index:
pass
else:
current_index = index
next_index = index + 1
if row[3] <= functions[next_index][2]:
next
elif row[4] == 'RES' or row[6] < functions[next_index][6]:
copied_current_row = row.copy()
row[3] = functions[next_index][2]
copied_current_row[2] = functions[next_index][3]
functions.append(copied_current_row)
There seems to be a logical problem in here, because that last append line seems to put the program into some kind of loop and I have to manually interrupt it. So I'm sure it's obvious to someone experienced, but I'm pretty new.
The reason I've done the comparison to see if a function is RES is that reserved should be subordinate to actual functions. But sometimes there are overlaps between actual functions, so I'll need to create another comparison to decide which one takes precedence, but this is where I'm starting.
How I (think) I want it to end up:
[['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/05 09:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[01,01,'RES',2019/10/05 12:00,2019/10/05 12:30],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[01,01,'RES',2019/10/05 13:30,2019/10/05 14:00],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[01,01,'RES',2019/10/05 14:00,2019/10/06 09:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00],
[01,01,'RES',2019/10/06 12:00,2019/10/06 17:00]]
This way, I could do a simple calculation of elapsed time for each function line and add it up to see how much time they had the space booked for.
What I'm looking for here is just some direction I should pursue, and I'm definitely not expecting anyone to do the work for me. For example, am I on the right path here, or would it be better to use pandas and vectorized functions? If I can get the basic direction right, I think I can muddle through the specifics.
Thank-you very much,
AF

Compiling a dictionary by pulling data from other dictionaries

I am doing a project in which I extract data from three different data sets and combine it to look at campaign contributions. To do this I turned the relevant data from two of the sets into dictionaries (canDict and otherDict) with ID numbers as keys and the information I need (party affiliation) as values. Then I wrote a program to pull party information based on the key (my third set included these ID numbers as well) and match them with the employer of the donating party, and the amount donated. That was a long winded explanation, but I thought it would help with understanding this chunk of code.
My problem is that, for some reason, my third dictionary (employerDict) won't compile. By the end of this step I should have a dictionary containing employers as keys, and a list of tuples as values, but after running it, the dictionary remains blank. I've been over this line by line a dozen times and I'm pulling my hair out - I can't for the life of me think why it won't work, which is making it hard to search for answers. I've commented almost every line to try to make it easier to understand out of context. Can anyone spot my mistake?
Update: I added a counter, n, to the outermost for loop to see if the program was iterating at all.
Update 2: I added another if statement in the creation of the variable party, in case the ID at data[0] did not exist in canDict or in otherDict. I also added some already suggested fixes from the comments.
n=0
with open(path3) as f: # path3 is a txt file
for line in f:
n+=1
if n % 10000 == 0:
print(n)
data = line.split("|") # Splitting each line into its entries (delimited by the symbol |)
party = canDict.get(data[0]) # data[0] is an ID number. canDict and otherDict contain these IDs as keys with party affiliations as values
if party is None:
party = otherDict[data[0]] # If there is no matching ID number in canDict, search otherDict
if party is None:
party = 'Other'
else:
print('ERROR: party is None')
x = (party, int(data[14])) # Creating a tuple of the the party (found through the loop) and an integer amount from the file path3
employer = data[11] # Index 11 in path3 is the employer of the person
if employer != '':
value = employerDict.get(employer) # If the employer field is not blank, see if this employer is already a key in employerDict
if value is None:
employerDict[employer] = [x] # If the key does not exist, create it and add a list including the tuple x as its value
else:
employerDict[employer].append(x) # If it does exist, add the tuple x to the existing value
else:
print('ERROR: employer == ''')
Thanks for all the input everyone - however, it looks like its a problem with my data file, not a problem with the program. Dangit.

Ordering objects by rating accounting for the number or ratings

I'm trying to do something similar to the first response in this SO question: SQL ordering by rating/votes, where resources may be rated (one rating per user per resource), but when ordering the resources based on their ratings, any resources with fewer than X separate ratings will appear below those with X or more.
I'm implementing this in Django and I'd very much prefer to avoid the use of raw query and keep within the Django model and query framework.
So far, this is what I have:
data = []
data_top = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_top:
data.append(d)
data_bottom = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_bottom:
data.append(d)
This all functions and returns the ordering by rating as I need, however, it doesn't feel very efficient - what with running 2 queries and looping over the results of each.
Is there a better way I can code this, either in a single query, or at least avoiding looping though each query set?
Any help much appreciated.
from itertools import chain
main_query = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating'))
data_top_query = main_query.exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data_bottom_query = main_query.exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data = list(chain(data_top_query, data_bottom_query))
Using itertools.chain is faster than looping each list and appending elements one by one
Also, the querysets will get evaluated when list is called on them (as they don't hit the database till then)
FYI, the above will hit the db twice when evaluated.
You're currently querying twice and iterating twice, but you can cut it down to one and one easily-just query for the items ordered by rating, then iterate like this:
data_top = []
data_bottom = []
data = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).order_by(order_by)
for d in data:
if data.rate_count >= settings.ORB_RESOURCE_MIN_RATINGS:
data_top.append(d)
else:
data_bottom.append(d)
data = data_top + data_bottom
This can also be done with the query only, by creating another aggregate column which contains the value rate_count < settings.ORB_RESOURCE_MIN_RATINGS (return 0 for values above or at the threshold, 1 for below) and sorting on (new_column, rating). Pretty sure this would require some custom SQL, but perhaps someone else knows otherwise.

Optimizing Checking for Words in a Dict and List

I have the following code
for key,value in jobs.items():
job = key
jobVector[key] = []
for x in range (0, len(listOfWords)):
if listOfWords[x] in jobs[job]:
jobVector[key].append(1)
else:
jobVector[key].append(0)
I have a dict, JOBS which has various words stored and a count for each. The count is irrelevant in this case, but lets say jobs is like this for one of the keys:
jobs[1] = account, addit, allow, ascertain, associ, avail, career, cellular, chang, coasttocoast, commiss, compani, competit, comput, countri, coupl, credit, custom, demand, develop, driven, dynam, employ, enjoi, ethic, exist, expand, experienc, fastest, flexibl, greet, growth, highperform, independ, individu, internet, knowledg, maintain, market, monitor, opportun, order, outstand, payment, person, phone, place, price, privatelyown, process, product, profession, provid, purchas, pursu, receiv, recommend, repres, resolv, respons, retail, right, selfmotiv, specif, store, support, technolog, territori, thatll, throughout, total, train, uniqu, unpreced, wireless, account, addit, aptitud, avail, bartend, benefit, bestbui, bilingu, cellular, colleg, commiss, commun, comput, consult, cross, custom, dedic, deduct, dental, direct, disabl, discount, effect, enterpris, entir, entrepreneuri, excel, execut, extend, famili, fleet, flexibl, goalori, health, impress, individu, insid, insur, integr, interperson, keyword, liter, longterm, medic, member, negoti, offer, outsid, packag, period, person, pleas, possess, possibl, pound, prefer, prescript, proud, provid, recogn, rentacar, repres, respons, retail, retir, salesman, salesperson, saleswoman, satisfi, shield, shortterm, spanish, spend, spirit, sprint, stand, technic, therefor, tmobil, vehicl, verbal, visit, websit, wireless, wwwjoincellularsalescom
lets say listOfWords is like this:
listOfWords = associ, avail, career, cellular, chang, coasttocoast, commiss, compani, competit, comput, countri, coupl, credit, custom, demand, develop, driven, dynam, employ, enjoi, ethic
I pretty much want to go through each word in the listOfWords and see if it exists in the individual job for each job in the JOBS dict. If it exists, store a 1, else store a 0 into another dictionary.
Is their any way to speed this up? It currently works, but takes about 3 minutes on a dataset of 15000 jobs.
First, you can speed things up a bit by replacing all those lists of jobs with sets of jobs. The code you've shown us then won't have to change at all, it'll just magically get faster, because an in test for a set is nearly instant, while an in test for a list has to check every value in the list.
You can also get some small speedups—and a big readability gain—by replacing the range loop with a direct loop, using value instead of re-looking it up, and turning the whole loop into a comprehension:
for key, value in jobs.items():
jobVector[key] = [1 if word in value else 0 for word in listOfWords]
Or even:
jobVector = {
key: [1 if word in value else 0 for word in listOfWords]
for key, value in jobs.items() }
Also, if this is for Python 2.x, use viewitems (if you don't need 2.6 or earlier) or iteritems (if you do) instead of items.
But really, beyond using a list in place of a set, I suspect there's a bigger problem with your data structures. Without knowing what you're trying to use these things for, it's hard to be sure, but I suspect you could make things both clearer and faster by using another dictionary, keyed off the individual jobs, so you can look them up instantly instead of exhaustively searching.
If each individual job can belong to only one job (your terminology here is really confusing, by the way…), this is just a dict mapping each individual job to its parent:
d = {ijob: job for job, ijobs in jobs.items() for ijob in ijobs}
If each individual job can belong to multiple jobs, you need to map each to the set of jobs it belongs to:
d = collections.defaultdict(set)
for job, ijobs in jobs.items():
for ijob in jobs:
d[ijob].add(job)
Then it seems like you don't really even need jobVector for anything, because it'll be as fast to look up its elements on the fly as to use the values you're precomputing.

How can I make my code more efficient?

I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)

Categories