How to programatically group lap times into teams to minimize difference? - python

Given the following (arbitrary) lap times:
John: 47.20
Mark: 51.14
Shellie: 49.95
Scott: 48.80
Jack: 46.60
Cheryl: 52.70
Martin: 57.65
Karl: 55.45
Yong: 52.30
Lynetta: 59.90
Sueann: 49.24
Tempie: 47.88
Mack: 51.11
Kecia: 53.20
Jayson: 48.90
Sanjuanita: 45.90
Rosita: 54.43
Lyndia: 52.38
Deloris: 49.90
Sophie: 44.31
Fleta: 58.12
Tai: 61.23
Cassaundra: 49.38 
Oren: 48.39
We're doing a go-kart endurance race, and the idea, rather than allowing team picking is to write a tool to process the initial qualifying times and then spit out the closest-matched groupings.
My initial investigation makes me feel like this is a clique graphing type situation, but having never played with graphing algorithms I feel rather out of my depth.
What would be the fastest/simplest method of generating groups of 3 people with the closest overall average lap time, so as to remove overall advantage/difference between them?
Is this something I can use networkx to achieve, and if so, how would I best define the graph given the dataset above?

When you're faced with a problem like this, one approach is always to leverage randomness.
While other folks say they think X or Y should work, I know my algorithm will converge to at least a local maxima. If you can show that any state space can be reached from any other via pairwise swapping (a property that is true for, say, the Travelling Salesperson Problem), then the algorithm will find the global optimum (given time).
Further, the algorithm attempts to minimize the standard deviation of the average times across the groups, so it provides a natural metric of how good an answer you're getting: Even if the result is non-exact, getting a standard deviation of 0.058 is probably more than close enough for your purposes.
Put another way: there may be an exact solution, but a randomized solution is usually easy to imagine, doesn't take long to code, can converge nicely, and is able to produce acceptable answers.
#!/usr/bin/env python3
import numpy as np
import copy
import random
data = [
(47.20,"John"),
(51.14,"Mark"),
(49.95,"Shellie"),
(48.80,"Scott"),
(46.60,"Jack"),
(52.70,"Cheryl"),
(57.65,"Martin"),
(55.45,"Karl"),
(52.30,"Yong"),
(59.90,"Lynetta"),
(49.24,"Sueann"),
(47.88,"Tempie"),
(51.11,"Mack"),
(53.20,"Kecia"),
(48.90,"Jayson"),
(45.90,"Sanjuanita"),
(54.43,"Rosita"),
(52.38,"Lyndia"),
(49.90,"Deloris"),
(44.31,"Sophie"),
(58.12,"Fleta"),
(61.23,"Tai"),
(49.38 ,"Cassaundra"),
(48.39,"Oren")
]
#Divide into initial groupings
NUM_GROUPS = 8
groups = []
for x in range(NUM_GROUPS): #Number of groups desired
groups.append(data[x*len(data)//NUM_GROUPS:(x+1)*len(data)//NUM_GROUPS])
#Ensure all groups have the same number of members
assert all(len(groups[0])==len(x) for x in groups)
#Get average time of a single group
def FitnessGroup(group):
return np.average([x[0] for x in group])
#Get standard deviation of all groups' average times
def Fitness(groups):
avgtimes = [FitnessGroup(x) for x in groups] #Get all average times
return np.std(avgtimes) #Return standard deviation of average times
#Initially, the best grouping is just the data
bestgroups = copy.deepcopy(groups)
bestfitness = Fitness(groups)
#Generate mutations of the best grouping by swapping two randomly chosen members
#between their groups
for x in range(10000): #Run a large number of times
groups = copy.deepcopy(bestgroups) #Always start from the best grouping
g1 = random.randint(0,len(groups)-1) #Choose a random group A
g2 = random.randint(0,len(groups)-1) #Choose a random group B
m1 = random.randint(0,len(groups[g1])-1) #Choose a random member from group A
m2 = random.randint(0,len(groups[g2])-1) #Choose a random member from group B
groups[g1][m1], groups[g2][m2] = groups[g2][m2], groups[g1][m1] #Swap 'em
fitness = Fitness(groups) #Calculate fitness of new grouping
if fitness<bestfitness: #Is it a better fitness?
bestfitness = fitness #Save fitness
bestgroups = copy.deepcopy(groups) #Save grouping
#Print the results
for g in bestgroups:
for m in g:
print("{0:15}".format(m[1]), end='')
print("{0:15.3f}".format(FitnessGroup(g)), end='')
print("")
print("Standard deviation of teams: {0:.3f}".format(bestfitness))
Running this a couple of times gives a standard deviation of 0.058:
Cheryl Kecia Oren 51.430
Tempie Mark Karl 51.490
Fleta Deloris Jack 51.540
Lynetta Scott Sanjuanita 51.533
Mack Rosita Sueann 51.593
Shellie Lyndia Yong 51.543
Jayson Sophie Tai 51.480
Martin Cassaundra John 51.410
Standard deviation of teams: 0.058

If I understand correctly, just sort the list of times and group the first three, next three, up through the top three.
EDIT: I didn't understand correctly
So, the idea is to take the N people and group them into N/3 teams, making the average times N/3 teams [rather than the 3 people within each team as I mistakenly interpreted] as close as possible. In this case, I think you could still start by sorting the N drivers in decreasing order of times. Then, initialize an empty list of N/3 teams. Then for each driver in decreasing order of lap time, assign them to the team with the smallest current total lap time (or one of these teams, in case of ties). This is a variant of a simple bin packing algorithm.
Here is a simple Python implementation:
times = [47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39]
Nteams = len(times)/3
team_times = [0] * Nteams
team_members = [[]] * Nteams
times = sorted(times,reverse=True)
for m in range(len(times)):
i = team_times.index(min(team_times))
team_times[i] += times[m]
team_members[i] = team_members[i] + [m]
for i in range(len(team_times)):
print(str(team_members[i]) + ": avg time " + str(round(team_times[i]/3,3)))
whose output is
[0, 15, 23]: avg time 51.593
[1, 14, 22]: avg time 51.727
[2, 13, 21]: avg time 51.54
[3, 12, 20]: avg time 51.6
[4, 11, 19]: avg time 51.48
[5, 10, 18]: avg time 51.32
[6, 9, 17]: avg time 51.433
[7, 8, 16]: avg time 51.327
(Note that the team members numbers refer to them in descending order of lap time, starting from 0, rather than to their original ordering).
One issue with this is that if the times varied dramatically, there is no hard restriction to make the number of players on each team exactly 3. However, for your purposes, maybe that's OK, if it makes the relay close, and its probably a rare occurrence when the spread in times is much less than the average time.
EDIT
If you do just want 3 players on each team, in all cases, then the code can be trivially modified to at each step find the team with the least total lap time that doesn't already have three assigned players. This requires a small modification in the main code block:
times = sorted(times,reverse=True)
for m in range(len(times)):
idx = -1
for i in range(Nteams):
if len(team_members[i]) < 3:
if (idx == -1) or (team_times[i] < team_times[idx]):
idx = i
team_times[idx] += times[m]
team_members[idx] = team_members[idx] + [m]
For the example problem in the question, the above solution is of course identical, because it did not try to fit more or less than 3 players per team.

The following algorithm appears to work pretty well. It takes the fastest and slowest people remaining and then finds the person in the middle so that the group average is closest to the global average. Since the extreme values are being used up first, the averages at the end shouldn't be that far off despite the limited selection pool.
from bisect import bisect
times = sorted([47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39])
average = lambda c: sum(c)/len(c)
groups = []
average_time = average(times)
while times:
group = [times.pop(0), times.pop()]
# target value for the third person for best average
target = average_time * 3 - sum(group)
index = min(bisect(times, target), len(times) - 1)
# adjust if the left value is better than the right
if index and abs(target - times[index-1]) < abs(target - times[index]):
index -= 1
group.append(times.pop(index))
groups.append(group)
# [44.31, 61.23, 48.9]
# [45.9, 59.9, 48.8]
# [46.6, 58.12, 49.9]
# [47.2, 57.65, 49.38]
# [47.88, 55.45, 51.14]
# [48.39, 54.43, 51.11]
# [49.24, 53.2, 52.3]
# [49.95, 52.7, 52.38]
The sorting and the iterated binary search are both O(n log n), so the total complexity is O(n log n). Unfortunately, expanding this to larger groups might be tough.

The simplest would probably be to just create 3 buckets--a fast bucket, a medium bucket, and a slow bucket--and assign entries to the buckets by their qualifying times.
Then team together the slowest of the slow, the fastest of the fast, and the median or mean of the mediums. (Not sure whether median or mean is the best choice off the top of my head.) Repeat until you're out of entries.

Related

Get an evenly distributed subset of combinations without repetition

I'm trying to get a subset of combinations such that every option is used the same amount of times, or close to it, from the total set of combinations without repetition. For example, I have 8 options (let's say A-H) and I need combinations of 4 letters where order doesn't matter. That would give me 70 possible combinations. I would like to take a subset of those combinations such that A appears as much as each other letter does, and A appears with B as much as C appears with D, etc. I know there are subsets where it is impossible to have each letter appear the same amount of times and appear with another letter the same amount of times so when I say "same amount of times" in this post, I mean the same amount or close to it.
If the options are written out in an organized list as is shown below, I couldn't just select the first N options because that would give A far more use than it would H. Also, A and B would appear together more than C and D. The main idea is to get as evenly distributed use of each letter combination as possible.
ABCD ABCE ABCF ABCG ABCH ABDE ABDF ABDG ABDH ABEF ABEG ABEH ABFG ABFH ABGH ACDE ACDF ACDG ACDH ACEF ACEG ACEH ACFG ACFH ACGH ADEF ADEG ADEH ADFG ADFH ADGH AEFG AEFH AEGH AFGH BCDE BCDF BCDG BCDH BCEF BCEG BCEH BCFG BCFH BCGH BDEF BDEG BDEH BDFG BDFH BDGH BEFG BEFH BEGH BFGH CDEF CDEG CDEH CDFG CDFH CDGH CEFG CEFH CEGH CFGH DEFG DEFH DEGH DFGH EFGH
I could take a random sample but being random, it doesn't exactly meet my requirements of taking a subset intentionally to get an even distribution. It could randomly choose a very uneven distribution.
Is there a tool or a mathematical formula to generate a list like I'm asking for? Building one in Python or some other coding language is an option if I had an idea of how to go about it.
You are asking the dealer to shuffle the deck.
The python standard library has a module, named random, containing a shuffle function. Present your eight options, shuffle them, and return the first four or however many you need. It will be random, obeying the distribution that you desire.
EDIT
I'm not sure how I could have expressed "shuffle" more clearly
but I will try, in math, in English and in code.
Draw a random permutation of 8 distinct elements and select the first 4.
Take a shuffled deck of 8 distinct cards, deal 4 of them, discard the rest.
#! /usr/bin/env python
from pprint import pp
import random
import matplotlib.pyplot as plt
import pandas as pd
import typer
class Options:
def __init__(self, all_options, k=4):
self.all_options = all_options
self.k = k
def new_deck(self):
deck = self.all_options.copy()
random.shuffle(deck)
return deck
def choose_options(self):
return self.new_deck()[: self.k]
def choose_many_options(self, n):
for _ in range(n):
yield "".join(self.choose_options())
def main(n: int = 10_000_000):
opt = Options(list("ABCDEFGH"))
demo = list(opt.choose_many_options(3))
pp(demo, width=22)
df = pd.DataFrame(opt.choose_many_options(n), columns=["opt"])
df["cnt"] = 1
with pd.option_context("display.min_rows", 16):
print(df.groupby("opt").sum())
cnts = df.groupby("opt").sum().cnt.tolist()
plt.plot(range(len(cnts)), cnts)
plt.gca().set_xlim((0, 1700))
plt.gca().set_ylim((0, None))
plt.gca().set_xlabel("combination of options")
plt.gca().set_ylabel("number of occurrences")
plt.show()
if __name__ == "__main__":
typer.run(main)
output:
['FABE',
'GEDC',
'FBAC']
cnt
opt
ABCD 6041
ABCE 5851
ABCF 6111
ABCG 5917
ABCH 6050
ABDC 5885
ABDE 5935
ABDF 5937
... ...
HGEC 5796
HGED 5922
HGEF 5859
HGFA 5936
HGFB 5880
HGFC 5869
HGFD 5942
HGFE 6049
[1680 rows x 1 columns]
P(n, k)
= P(8, 4) = n! / (n - k)!
= 40,320 / 24
= 1680
All possible combinations of options have been randomly drawn.
Here is the number of occurrences of each distinct draw.
Note that 5952 occurrences × 1680 gets us to ~ 10 million.
The PRNG arranged matters
"such that every option is used the same amount of times, or close to it."
Having repeatedly rolled a many-sided dice,
we see the anticipated mean and standard deviation
show up in the experimental results.

how can i refactor my python code to decrease the time complexity

this code takes 9 sec which is very long time, i guess the problem in 2 loop in my code
for symptom in symptoms:
# check if the symptom is mentioned in the user text
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
print(getSimilarity([combin, norm_symptom]))
if getSimilarity([combin,norm_symptom])>0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
i tried to use zip like this:
for symptom, combin in zip(symptoms,list_of_combinations):
norm_symptom = symptom.replace("_"," ")
if (getSimilarity([combin, norm_symptom]) > 0.25 and symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)
Indeed, you're algorithm is slow because of the 2 nested loops.
It performs with big O N*M (see more here https://www.freecodecamp.org/news/big-o-notation-why-it-matters-and-why-it-doesnt-1674cfa8a23c/)
N being the lenght of symptoms
and M being the list_of_combinations
What can takes time also is the computation getSimilarity, what is this operation ?
Use a dict to store the results of getSimilarity for each combination and symptom. This way, you can avoid calling getSimilarity multiple times for the same combination and symptom. This way it will be more efficient, thus faster.
import collections
similarity_results = collections.defaultdict(dict)
for symptom in symptoms:
norm_symptom = symptom.replace("_"," ")
for combin in list_of_combinations:
# Check if the similarity has already been computed
if combin in similarity_results[symptom]:
similarity = similarity_results[symptom][combin]
else:
similarity = getSimilarity([combin, norm_symptom])
similarity_results[symptom][combin] = similarity
if similarity > 0.25:
if symptom not in extracted_symptoms:
extracted_symptoms.append(symptom)
Update:
Alternatively You could use an algorithm based on the Levenshtein distance, which is a measure of the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. Python-Levenshtein library does that.
import Levenshtein
def getSimilarity(s1, s2):
distance = Levenshtein.distance(s1, s2)
return 1.0 - (distance / max(len(s1), len(s2)))
extracted_symptoms = []
for symptom, combin in zip(symptoms, list_of_combinations):
norm_symptom = symptom.replace("_", " ")
if (getSimilarity(combin, norm_symptom) > 0.25) and (symptom not in extracted_symptoms):
extracted_symptoms.append(symptom)

Given 2D coordinates of persons in a list, how do I find people with the distance from each other is below a threshold?

I have list of 3-tuples:
name
coordinate x
coordinate y
A jerk is a person whose distance to any other person in a list is less than 2.
I think that my code works, but because of a nested loop its complexity is O(N²). Is it possible to make it more efficient?
people=[
('Mickey', 4.6, 3.2),
('Donald', 6.1, 3.2),
('Bambi', 6.2, 5.2),
('Goofy', 6.4, 2.0),
('Eeyore', 7.0, 6.4)]
min_distance_sq=2.**2
jerks=set()
for p1 in people:
if p1 in jerks:
continue
for p2 in people:
if p1[1]==p2[1]:
continue
distance_sq=(p1[2]-p2[2])**2+(p1[3]-p2[3])**2
if distance_sq < min_distance_sq:
jerks.add(p1)
jerks.add(p2)
break
for jerk in jerks:
print(jerk[0],'from',jerk[1],'is jerk')
One problem in your code is that you are getting the distance between Bambi and Mickey, for example, again knowing that you already got the distance between Mickey and Bambi before (You are looping over all people again each time).
How to Fix
You have to loop only forward, I mean you compare Mickey with all the others, then you compare Donald with them all except Mickey and so on. So you are comparing:
'Mickey' with ['Donald', 'Bambi', 'Goofy', 'Eeyore']
'Donald' with ['Bambi', 'Goofy', 'Eeyore']
'Bambi' with ['Goofy', 'Eeyore']
'Goofy' with ['Eeyore']
'Eeyore' with []
So this is the code to do this:
people=[
('Mickey','1 High Street',4.6,3.2),
('Donald','1 High Street',6.1,3.2),
('Bambi','2 High Street',6.2,5.2),
('Goofy','3 High Street',6.4,2.0),
('Eeyore','2 High Street',7.0,6.4)]
min_distance_sq=2.**2
jerks=set()
for i, p1 in enumerate(people):
if p1 not in jerks:
for p2 in people[i+1:]: # execlude p1 and the people before
if p2[1] != p1[1]:
distance_sq=(p1[2]-p2[2])**2+(p1[3]-p2[3])**2
if distance_sq < min_distance_sq:
jerks.add(p1)
jerks.add(p2)
break
for jerk in jerks:
print(jerk[0],'from',jerk[1],'is jerk')
Running the code gives me the output as follows:
Donald from 1 High Street is jerk
Bambi from 2 High Street is jerk
I don't know if that's what you want, if you could put your output it would be clearer.
Update
I updated the code not to compare people from same High street. The output is now the same as you got in your code. Thanks to #Stef 's comment
You can build some kind of binary space partition (BSP) data structure.
For example, k-d tree with building time O(nlogn) and make queries "Nearest neighbour" for every item
Improvement can be achieved by:
Caching square calculations, thus calculating them only once.
Comparing looping on people comparison in such a way we only compare each person with another once.
This can be implemented by looping on all people, in the first loop,
and looping on the people to be measured with, in the second loop, by iterating over all people that are after the index of the current first-loop person.
people=[
('Mickey','1 High Street',4.6,3.2),
('Donald','1 High Street',6.1,3.2),
('Bambi','2 High Street',6.2,5.2),
('Goofy','3 High Street',6.4,2.0),
('Eeyore','2 High Street',7.0,6.4)]
squared_distance_dict = dict()
min_distance_radius=2.
min_distance_sq=min_distance_radius**2
squared_distance_dict[min_distance_radius] = min_distance_sq
jerks=set()
for p1 in people:
for i, p1 in enumerate(people):
for p2 in people[i+1:]:
# Calculating distances
x_diff = p1[2]-p2[2]
y_diff = p1[3]-p2[3]
# Getting squared results from cache dictionary, or calculating them if not present
x_diff_sq = squared_distance_dict.get(x_diff, x_diff ** 2)
y_diff_sq = squared_distance_dict.get(y_diff, y_diff ** 2)
distance_sq=x_diff_sq+y_diff_sq
if distance_sq < min_distance_sq:
jerks.add(p1)
jerks.add(p2)
break
for jerk in jerks:
print(jerk[0],'from',jerk[1],'is jerk')

Randomly split users into two groups in the percentage of 80:20

I have a list of ids called users and want to split them randomly into two groups by the percentage of 80:20.
For example i have a list of 100 users ids and randomly put 80 users into group1 and remaining 20 into group2
def getLevelForIncrementality(Object[] args) {
try {
if (args.length >= 1 && args[0]!="") {
String seed = args[0] + "Testing";
int rnd = Math.abs(seed.hashCode() % 100);
return (rnd >= 80 ? 2 : 1);
}
} catch (Exception e) { }
return 3;
}
I have tried from the above groovy code which gives me in the ratio of 82:18.
Can someone give me some insights or suggestions or alogrithms which can solve the above problem for millions of user ids.
You can use random.sample to randomly extract the needed number of elements:
import random
a = list(range(1000))
b = random.sample(a, int(len(a) * 0.8))
len(b)
800
If you have unique IDs, you can try to convert these lists of IDs to sets and differ them like this:
c = list(set(a) - set(b))
it can be also done using train_test_split of sklearn
import numpy as np
from sklearn.model_selection import train_test_split
X = list(np.arange(1000))
x_80_percent, x_20_percent = train_test_split(X, test_size =.20, shuffle = True)
In order to distribute data "on the fly" without creating large lists, you can use a small control list that will tell you how to part users into the two groups (by chunks of 5).
spread = []
while getNextUser():
if not spread
spread = [1,1,1,1,0] # number of 1s and 0s is 4 vs 1 (80%)
random.shuffle(spread)
if spread.pop():
# place on 80% side
else:
# place on 20% side
This will ensure a perfect 80:20 split every fifth user through with a maximum imbalance of 4. As more users are processed this imbalance will become less and less significant.
Worst cases:
19.2% instead of 20% after 99 users, corrects to perfect 20% at 100
19.9% after 999 users, corrects to perfect 20% at 1000
19.99% after 9999 users, corrects to perfect 20% at 10000
Note: you can change the number of 1s and 0s in the spread list to get a different proportion. e.g. [1,1,0] will give you 2 vs 1; [1,1,1,0] is 3 vs 1 (75:25); [1]*13+[0]*7 is 13 vs 7 (65:35)
You can generalize this into a generator that will do the proper calculations and initializations for you:
import random
from math import gcd
def spreadRatio(a,b):
d = gcd(a,b)
base = [True]*(a//d)+[False]*(b//d)
spread = []
while True:
if not spread:
spread = base.copy()
random.shuffle(spread)
yield spread.pop()
pareto = spreadRatio(80,20)
while getNextUser():
if next(pareto):
# place on 80% side
else:
# place on 20% side
This also works for spliting a list:
A = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] ## Sample List
l = (len(A)/10) *8 ## making 80 %
B = A[:int(l)] ## Getting 80% of list
C = A[int(l):] ## Getting remaining list

reduce a list of string values by similarity score

I am facing a machine learning problem; Learning data consists of numeric, categorical and dates. I started training only based on numerics and dates (that I converted to numeric using epoch, week day, hours and so...). Apart from a poor score, performance are very well (seconds of training on one million entry).
The problem is with categoricals which most many values up to thousands.
Values consist of equipment brands, comments and so, and are humanly entered, So I assume there are much resemblances. I can sacrifice a bit of real world representation through data (hense score) for feasibility (training time).
Programming challenge: I came up with this from this nice performance analysis
import difflib
def gcm1(strings):
clusters = {}
co = 0
for string in (x for x in strings):
if(co % 10000 == 0 ):
print(co)
co = co +1
if string in clusters:
clusters[string].append(string)
else:
match = difflib.get_close_matches(string, clusters.keys(), 1, 0.90)
if match:
clusters[match[0]].append(string)
else:
clusters[string] = [ string ]
return clusters
def reduce(lines_):
clusters = gcm1(lines_)
clusters = dict( (v,k) for k in clusters for v in clusters[k] )
return [clusters.get(item,item) for item in lines_]
Example of this is like:
reduce(['XHSG11', 'XHSG8', 'DOIIV', 'D.OIIV ', ...]
=> ['XHSG11', 'XHSG11', 'DOIIV', 'DOIIV ', ...]
I am very bound to Python so I couldn't get other C implemented code running.
Obviously, the function difflib.get_close_matches in each iteration is the most greedy.
Is there an better alternative? or a better method of my algorithm?
As I said on million entries, on let's say 10 columns, I can't even estimate when the algorithm stops (more than 3 hours and still running on my 16 gigs of RAM and i7 4790k CPU)
Data is like (extract):
Comments: [nan '1er rdv' '16H45-VE' 'VTE 2016 APRES 9H'
'ARM : SERENITE DV. RECUP.CONTRAT. VERIF TYPE APPAREIL. RECTIF TVA SI NECESSAIRE']
422227 different values
MODELE_CODE: ['VIESK02534' 'CMA6781031' 'ELMEGLM23HNATVMC' 'CMACALYDRADELTA2428FF'
'FBEZZCIAO3224SVMC']
10206 values
MARQUE_LIB: ['VIESSMANN' 'CHAFFOTEAUX ET MAURY' 'ELM LEBLANC' 'FR BG' 'CHAPPEE']
167 values
... more columns

Categories