I am working on a project that needs to say that a certain ID is most likely.
Let me explain using example.
I have 3 dictionaries that contain ID's and their score
Ex: d1 = {74701: 3, 90883: 2}
I assign percentage score like this,
d1_p = {74701: 60.0, 90883: 40.0} , here the score is the (value of key in d1)/(total sum of values)
Similarly i have 2 other dictionaries
d2 = {90883: 2, 74701: 2} , d2_p = {90883.0: 50.0, 74701.0: 50.0}
d3 = {75853: 2}, d3_p = {75853: 100.0}
My task is to give a composite score for each ID from the above 3 dictionaries a decide a winner by taking the highest score. How would i mathematically assign a composite score between 0-100 for each of these ID's??
Ex: in above case 74701 needs to be the clear winner.
I tried giving average, but it fails, because I need to give more preference for the ID's that occur in multiple dictionaries.
Ex: lets say 74701 was majority in d1 and d2 with 30,40 values. then its average will be (30+40+0)/3 = 23.33 , while 75853 which occurs only once with 100% will get (100+0+0)/3 = 33.33 and it will be given as winner, which is wrong.
Hence can somone suggest a good mathematical way in python with maybe code to give such score and decide majority?
Instead of trying to create a global score from different dictionaries, since your main goal is to analyze frequency I would suggest to summarize all the data into a single dictionary, which is less error prone in general. Say I have 3 dictionaries:
a = {1: 2, 2: 3}
b = {2: 4, 3: 5}
c = {3: 4, 4: 9}
You could summarize these three dictionaries into one by summing the values for each key:
result = {1: 2, 2: 7, 3: 9, 4: 9}
That could be easily achieved by using Counter:
from collections import Counter
result = Counter(a)
result.update(Counter(b))
result.update(Counter(c))
result = dict(result)
Which would yield the desired summary. If you want different weights for each dictionary that could also be done in a similar fashion, the takeaway is that you should not be trying to obtain information from the dictionaries as separate entities, but instead merge them together into one statistic.
Think of the data in a tabular way: for each game/match/whatever, each ID gets
a certain number of points. If you care the most about overall point total for
the entire sequences of games (the entire "season", so to speak), then add up
the points to determine a winner (and then scale everything down/up to 0 to
100).
74701 90883 75853
---------------------------
1 3 2 0
2 2 2 0
3 0 0 2
Total 5 4 2
Alternatively, we can express those same scores in percentage terms per game.
Again, every ID must be given a value. In this case, we need to average the
percentages -- all of them, including the zeros:
74701 90883 75853
---------------------------
1 .6 .4 0
2 .5 .5 0
3 0 0 100
Avg .37 .30 .33
Both approaches could make sense, depending on the context. And both also
declare 74701 to be the winner, as desired. But notice that they give different
results for 2nd and 3rd place. Such differences occur because the two systems
prioritize different things. You need to decide which approach you prefer.
Either way, the first step is to organize the data better. It seems more
convenient to have all scores or percentages for each ID, so you can do the
needed math with them: that sounds like a dict mapping IDs to lists of scores
or percentages.
# Put the data into one collection.
d1 = {74701: 3, 90883: 2}
d2 = {90883: 2, 74701: 2}
d3 = {75853: 2}
raw_scores = [d1, d2, d3]
# Find all IDs.
ids = tuple(set(i for d in raw_scores for i in d))
# Total points/scores for each ID.
points = {
i : [d.get(i, 0) for d in raw_scores]
for i in ids
}
# If needed, use that dict to create a similar dict for percentages. Or you
# could create a dict with the same structure holding *both* point totals and
# percentages. Just depends on the approach you pick.
pcts = {}
for i, scores in points.items():
tot = sum(scores)
pcts[i] = [sc / tot for sc in scores]
Related
I'm trying to replace duplicates in my data, and I'm looking for an efficient way to do that.
I have a df with 2 columns, idA and idB, like this:
idA idB
22 5
22 590
5 6000
This is a df with similarities.
I want to create a dictionary in which the key is the id, and the value is a list with all the devices linked to the key. Example:
d[5] = [22, 6000]
d[22] = [5, 590]
What I'm doing is the following:
ids = set(gigi_confirmed['idA'].unique()).union(set(gigi_confirmed['idB'].unique()))
dup_list = list(zip(A_confirmed, B_confirmed))
dict_dup = dict()
for j in ids:
l1 = []
for i in range(0, len(dup_list)):
if j in dup_list[i]:
l2 = list(dup_list[i])
l2.remove(j)
l1.append(l2[0])
dict_dup[j] = l1
Is it possible to make it more efficiently?
I have to do some guessing here, because you question is no super clear, but the way I understand it, you want a dictionary that maps each id in idA or idB to the list of ids found on the other side, from that id.
If I understood your problem correctly, I would solve it by directly constructing a dictionary mapping ids to sets of ids.
idA = [22, 22, 5]
idB = [5, 590, 6000]
dict_dup = dict()
for a, b in zip(idA, idB):
if a not in dict_dup:
dict_dup[a] = set()
dict_dup[a].add(b)
if b not in dict_dup:
dict_dup[b] = set()
dict_dup[b].add(a)
After this runs, print(dict_dup) outputs
{22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}}
which I think is the data structure you're looking for.
By using dicts and sets, this code is very efficient. It will run in linear time over the number of ids.
Shorter code with defaultdict
You can also make this code a lot shorter by using a defaultdict instead of a regular dict, which will automatically create those empty sets when needed:
from collections import defaultdict
idA = [22, 22, 5]
idB = [5, 590, 6000]
dict_dup = defaultdict(set)
for a, b in zip(idA, idB):
dict_dup[a].add(b)
dict_dup[b].add(a)
The print statements produces slightly different output, but it's equivalent:
defaultdict(<class 'set'>, {22: {5, 590}, 5: {6000, 22}, 590: {22}, 6000: {5}})
This still contains the info you want, and is just as efficient as the first solution.
Putting it back in your data frame
Now, if you need to put this information back in your dataframe, you can use dict_dup to efficiently retrieve what you're looking for for each row.
In my program there are multiple quizzes. A user takes a quiz, then the title of the quiz and the score are saved to a database. For ease with the example, I'll represent them using Python lists:
[['quizTitle1', score], ['quizTitle2',score] ['quizTitle1', score] ['quizTitle3', score]]
I’m trying to print out the title of the quiz that a user is weakest on.
So, using the Python list example you see that the user has taken quiz 1 two times. On their second go they may have got a better score for the quiz than the first. So, I need to get the highest score the user has achieved with each quiz (their best score). Then I need to find which quiz has the lowest, best score.
My current plan is like this (pseudo code)
While found = false
1st = the first score selected that we are comparing with each other score
2nd = the score we are comparing to the first
For loop that repeats in the range of the number of lists
If (2nd < 1st) or (2nd has the same title and greater mark than 1st):
2nd becomes 1st
Loop repeats
Else:
New 2nd is the next list
Found = true
But what is the best way to do this?
You could use a dictionary to store the value of each quiz and update its value with maximum seen so far in your list, then get minimum of all values in the dictionary.
scores = [['q1', 20],['q2',30],['q1',40],['q2',10],['q2',45],['q1',10]]
d = {}
for s in scores:
d[s[0]] = s[1] if s[0] not in d else max(d[s[0]], s[1])
print(d)
print("Lowest best : ", min(d.values()))
This prints:
{'q1': 40, 'q2': 45}
Lowest best : 40
Well, if you are open to pandas, then:
import pandas as pd
l = [["quizTitle1", 15],
["quizTitle2", 25],
["quizTitle1", 11],
["quizTitle3", 84],
["quizTitle2", 24]]
df = pd.DataFrame(l, columns=["quiz", "score"])
print(df)
# quiz score
# 0 quizTitle1 15
# 1 quizTitle2 25
# 2 quizTitle1 11
# 3 quizTitle3 84
# 4 quizTitle2 24
lowest_score = df.iloc[df.groupby(['quiz']).max().reset_index()["score"].idxmin()]
print(lowest_score)
# quiz quizTitle1
# score 15
# Name: 0, dtype: object
A simple one:
scores = [['q1', 20],['q2',30],['q1',40],['q2',10],['q2',45],['q1',10]]
d = dict(sorted(scores))
print(min(d, key=d.get)) # prints q1
The dict function takes key/value pairs, we just need to sort them first so that each key's last value is it's largest (because the last is what ends up in the dict). After that, the desired result is simply the key with the smallest value.
A map-reduce approach:
from itertools import groupby
from operator import itemgetter
scores = [['q1', 20],['q2',30],['q1',40],['q2',10],['q2',45],['q1',10]]
name, score = itemgetter(0), itemgetter(1)
grouped_scores = groupby(sorted(scores), key=name) # group by key
highest_scores = (max(g, key=score) for _,g in grouped_scores) # reduce by key
lowest_highest = min(highest_scores, key=score) # reduce
print(lowest_highest)
Output:
['q1', 40]
Explanation
The functions used are:
sorted (docs/builtin/sorted) to sort results by quizz name
itertools.groupby (docs/itertools/groupby), which groups results by quiz assuming they are already sorted by quizz;
a generator expression, to apply a function to every element of a list... here we have a list of lists and we apply the function max to every list;
max and min (docs/builtin/min), my two "reduce" functions.
The return values of groupby and the generator expression are not lists and if you try to print them directly, you'll see a bunch of unhelpful <itertools._grouper object at 0x7ff18bbbb850>. But converting every non-printable object to a list using list(), the intermediate values computed are these:
scores = [['q1', 20],['q2',30],['q1',40],['q2',10],['q2',45],['q1',10]]
grouped_scores = [
['q1', [['q1', 10], ['q1', 20], ['q1', 40]]],
['q2', [['q2', 10], ['q2', 30], ['q2', 45]]]
]
highest_scores = [['q1', 40], ['q2', 45]]
lowest_highest = ['q1', 40]
Python's map and reduce
Two functions which can often be useful in a map-reduce algorithm:
map (docs/builtin/map), instead of the generator expression, to apply a function to every element of a list;
functools.reduce (docs/functools/reduce), to repeatedly apply a binary function to the elements in a list, two by two, and replace those two elements by the result, until there is only one element left.
In this case, we are looking for the lowest of the highest scores, so when comparing two elements we would like to keep the min of the two. But instead of applying the min() function repeatedly with reduce, in python we can call min() directly on the whole sequence.
Just for reference, here is what the code would look like if we had used reduce:
from itertools import groupby
from functools import reduce
scores = [['q1', 20],['q2',30],['q1',40],['q2',10],['q2',45],['q1',10]]
name, score = itemgetter(0), itemgetter(1)
grouped_scores = groupby(sorted(scores), key=name) # group by key
highest_scores = map(lambda x: max(x[1], key=score), grouped_scores) # reduce by key
lowest_highest = reduce(lambda x,y: min(x,y, key=score), highest_scores) # reduce
print(lowest_highest)
Output:
['q1', 40]
Using module more_itertools
Module more_itertools has a function called map_reduce which groups by key, then reduces by key. This takes care of our groupby and max steps; we only need to reduce with min and we have our result.
from more_itertools import map_reduce
from operator import itemgetter
scores = [['q1', 20],['q2',30],['q1',40],['q2',10],['q2',45],['q1',10]]
name, score = itemgetter(0), itemgetter(1)
highest_scores = map_reduce(scores, keyfunc=name, valuefunc=score, reducefunc=max)
lowest_highest = min(highest_scores.items(), key=score)
print(lowest_highest)
# ('q1', 40)
Here is a version using defaultdict, from the built-in collections module. In this case, the value of a key we haven't seen before is an empty list; we don't need to check first, we just append.
from collections import defaultdict
quizzes = defaultdict(list)
scores = [['q1', 20],['q2',30],['q1',40],['q2',10],['q2',45],['q1',10]]
# populate the dictionary of results
for score in scores:
quiznum = score[0]
result = score[1]
quizzes[quiznum].append(result) # new key? we append to empty list
quizzes
# find min score for each quiz
{ quiznum: min(scores)
for quiznum, scores in quizzes.items()
}
{'q1': 10, 'q2': 10}
The defaultdict keeps all of the scores, which is not necessary for the posted question. But it will let you determine number of attempts, high score, etc.
This is the fastest way using Python functions:
lst = [['quizTitle1', 6], ['quizTitle2', 5], ['quizTitle1', 9], ['quizTitle3', 7]]
sorted_list = sorted(lst, key=lambda x: x[1])
print(f'1st quiz: {sorted_list[-1][0]} | score: {sorted_list[-1][1]}')
print(f'last on quiz: {sorted_list[0][0]} | score: {sorted_list[0][1]}')
Basically you ask the list to be ordered and then you ask for the last value, which is the higher, and the last, which is 1st in the list. However this is not an algorithm.
I have a dictionary (studentPerf) which has all of the students in a school, with tuples as keys. I want to count the number of male students and the number of female students in the school, and use this to update the values in a second dictionary. The second dictionary (dictDemGender) has 2 keys, male and female, and 0s as the values. How can I change the 0s in dictDemGender to reflect the number of males and females in the school?
Could I do this with dictionary comprehension?
I've given the first few entries to studentPerf:
studentPerf = {('Jeffery','male','junior'):[0.81,0.75,0.74,0.8],
('Able','male','senior'):[0.87,0.79,0.81,0.81],
('Don','male','junior'):[0.82,0.77,0.8,0.8],
('Will','male','senior'):[0.86,0.78,0.77,0.78],
('John','male','junior'):[0.74,0.81,0.87,0.73]}
#Creates a dictionary with genders as keys and 0s as the values to fill later
dictDemGender = {k:0 for k in genders}
dictDemGender = ?
I did ask a similar question but had diagnosed the problem incorrectly. I previously asked for help with finding an average score. What I actually need is a count of the different key possibilities. I need to be able to do so without any outside packages unfortunately.
Use collections.Counter:
from collections import Counter
studentPerf = {('Jeffery','male','junior'):[0.81,0.75,0.74,0.8],
('Able','male','senior'):[0.87,0.79,0.81,0.81],
('Don','male','junior'):[0.82,0.77,0.8,0.8],
('Will','male','senior'):[0.86,0.78,0.77,0.78],
('John','male','junior'):[0.74,0.81,0.87,0.73]}
print(Counter(x[1] for x in studentPerf))
# Counter({'male': 5})
Or, if you need empty counts also:
gender = {'male': 0, 'female': 0}
gender.update(Counter(x[1] for x in studentPerf))
# {'male': 5, 'female': 0}
Or, using dict.fromkeys() with Counter:
d = {'male', 'female'}
gender = dict.fromkeys(d, 0)
gender.update(Counter(x[1] for x in studentPerf))
# {'female': 0, 'male': 5}
Assuming the expected output is {'male':5, 'female':0}, consider using a Counter.
>>> from collections import Counter
>>> c = Counter(male=0, female=0)
>>> c.update(gen for _, gen, _ in studentPerf)
>>> c
Counter({'female': 0, 'male': 5})
Initializing the two keys with zeros is not really necessary, you could also write
>>> c = Counter(gen for _, gen, _ in studentPerf)
>>> c
Counter({'male': 5})
because Counter lookup defaults to zero for missing keys:
>>> c['female']
0
As I said, I was looking for a solution that did not require outside packages. I know the way I've gone about this is rather cumbersome but this was for a class and the exercise had these requirements. I found a way to count all of the males and females and input those values into the dictDemGender dictionary.
genCounts = ([x[1] for x in list(studentPerf.keys())].count('female'), [x[1] for x in list(studentPerf.keys())].count('male'))
dictDemGender = dict(zip(dictDemGender.keys(), genCounts))
I need to create a program that has a class that crates an object "Food" and a list called "fridge" that holds these objects created by class "Food".
class Food:
def __init__(self, name, expiration):
self.name = name
self.expiration = expiration
fridge = [Food("beer",4), Food("steak",1), Food("hamburger",1), Food("donut",3),]
This was not hard. Then i created an function, that gives you a food with highest expiration number.
def exp(fridge):
expList=[]
xen = 0
for i in range(0,len(fridge)):
expList.append(fridge[xen].expiration)
xen += 1
print(expList)
sortedList = sorted(expList)
return sortedList.pop()
exp(fridge)
This one works too, now i have to create a function that returns a list where the index of the list is the expiration date and the number of that index is number of food with that expiration date.
The output should look like: [0,2,1,1] - first index 0 means that there is no food with expiration date "0". Index 1 means that there are 2 pieces of food with expiration days left 1. And so on. I got stuck with too many if lines and i cant get this one to work at all. How should i approach this ? Thanks for the help.
In order to return it as a list, you will first need to figure out the maximum expiration date in the fridge.
max_expiration = max(food.expiration for food in fridge) +1 # need +1 since 0 is also a possible expiration
exp_list = [0] * max_expiration
for food in fridge:
exp_list[food.expiration] += 1
print(exp_list)
returns [0, 2, 0, 1, 1]
You can iterate on the list of Food objects and update a dictionary keyed on expiration, with the values as number of items having that expiration. Avoid redundancy such as keeping zero counts in a list by using a collections.Counter object (a subclass of dict):
from collections import Counter
d = Counter(food.expiration for food in fridge)
# fetch number of food with expiration 0
print(d[0]) # -> 0
# fetch number of food with expiration 1
print(d[1]) # -> 2
You can use itertools.groupby to create a dict where key will be the food expiration date and value will be the number of times it occurs in the list
>>> from itertools import groupby
>>> fridge = [Food("beer",4), Food("steak",1), Food("hamburger",1), Food("donut",3),]
>>> d = dict((k,len(list(v))) for k,v in groupby(sorted(l,key=lambda x: x.expiration), key=lambda x: x.expiration))
Here we specify groupby to group all elements of list that have same expiration(Note the key argument in groupby). The output of groupby operation is roughly equivalent to (k,[v]), where k is the group key and [v] is the list of values belong to that particular group.
This will produce output like this:
>>> d
>>> {1: 2, 3: 1, 4: 1}
At this point we have expiration and number of times a particular expiration occurs in a list, stored in a dict d.
Next we need to create a list such that If an element is present in the dict d output it, else output 0. We need to iterate from 0 till max number in dict d keys. To do this we can do:
>>> [0 if not d.get(x) else d.get(x) for x in range(0, max(d.keys())+1)]
This will yield your required output
>>> [0,2,0,1,1]
Here is a flexible method using collections.defaultdict:
from collections import defaultdict
def ReverseDictionary(input_dict):
reversed_dict = defaultdict(set)
for k, v in input_dict.items():
reversed_dict[v].add(k)
return reversed_dict
fridge_dict = {f.name: f.expiration for f in fridge}
exp_food = ReverseDictionary(fridge_dict)
# defaultdict(set, {1: {'hamburger', 'steak'}, 3: {'donut'}, 4: {'beer'}})
exp_count = {k: len(exp_food.get(k, set())) for k in range(max(exp_food)+1)}
# {0: 0, 1: 2, 2: 0, 3: 1, 4: 1}
Modify yours with count().
def exp(fridge):
output = []
exp_list = [i.expiration for i in fridge]
for i in range(0, max(exp_list)+1):
output.append(exp_list.count(i))
return output
let us say we have a few clusters represented as dictionaries below:
cluster1 = {'Disks' : [0,1,2,3,12] , 'left': True , 'right': False}
cluster2 = {'Disks' : [3,4,5,2] , 'left':True ,'right': False }
cluster3 = {'Disks' : [6,7,8,2] , 'left':False ,'right': False }
cluster4 = {'Disks' : [10,11,12] , 'left':True, 'right':True }
Listofclusters = [cluster1,cluster2,cluster3,cluster4]
Then I make a list of the clusters as above to store them....
If i want to search the list for a particular disk and would like it to tell me which clusters within the list have those disks
how would i do that?
Based on the data you have, this should do it:
def findClusters(n, clusters):
answer = []
for cluster in clusters:
if n in cluster['Disks']:
answer.append(cluster)
return answer
Now, that's a linear algorithm. With a little preprocessing, you should be able to improve the runtime substantially:
def preprocess(clusters):
"""
Given the list of clusters, return a dictionary that maps
Disk numbers to a list of clusters that have that disk
"""
answer = {}
for i,cluster in enumerate(clusters):
for disk in cluster['Disks']
if disk not in answer:
answer[disk] = []
answer[disk].append(i)
return answer
def findClusters(preprocessedData, clusters, diskNum):
answer = []
for clusterid in preprocessedData[diskNum]:
answer.append(clusters[clusterid])
return answer
The preprocessing takes linear time, while the actual search takes constant time (to find the relevant clusters) and linear time (in the order of the number of clusters found) to create the list of the relevant clusters
According to your requirement:
to search the list for a particular disk and would like it to tell me
which clusters within the list have those disks
Use the following approach which will form a dict cluster_numbers which key is a disk number and value is a list of cluster names(cluster order numbers)
Let's find all cluster names(numbers) which have one or more disk numbers from the following list [2, 10, 7]
search_disks = [2, 10, 7]
cluster_numbers = {d:[] for d in search_disks}
for d in cluster_numbers.keys():
for k, c in enumerate(Listofclusters):
if d in c['Disks']: cluster_numbers[d].extend(['cluster' + str(k+1)])
print(cluster_numbers)
The output:
{2: ['cluster1', 'cluster2', 'cluster3'], 10: ['cluster4'], 7: ['cluster3']}
If you for example, you want to search for disk 10 you can do the following:
>>> [cluster for cluster in Listofclusters if 10 in cluster['Disks']]
... [{'Disks': [10, 11, 12], 'right': True, 'left': True}]