Get an evenly distributed subset of combinations without repetition - python

I'm trying to get a subset of combinations such that every option is used the same amount of times, or close to it, from the total set of combinations without repetition. For example, I have 8 options (let's say A-H) and I need combinations of 4 letters where order doesn't matter. That would give me 70 possible combinations. I would like to take a subset of those combinations such that A appears as much as each other letter does, and A appears with B as much as C appears with D, etc. I know there are subsets where it is impossible to have each letter appear the same amount of times and appear with another letter the same amount of times so when I say "same amount of times" in this post, I mean the same amount or close to it.
If the options are written out in an organized list as is shown below, I couldn't just select the first N options because that would give A far more use than it would H. Also, A and B would appear together more than C and D. The main idea is to get as evenly distributed use of each letter combination as possible.
ABCD ABCE ABCF ABCG ABCH ABDE ABDF ABDG ABDH ABEF ABEG ABEH ABFG ABFH ABGH ACDE ACDF ACDG ACDH ACEF ACEG ACEH ACFG ACFH ACGH ADEF ADEG ADEH ADFG ADFH ADGH AEFG AEFH AEGH AFGH BCDE BCDF BCDG BCDH BCEF BCEG BCEH BCFG BCFH BCGH BDEF BDEG BDEH BDFG BDFH BDGH BEFG BEFH BEGH BFGH CDEF CDEG CDEH CDFG CDFH CDGH CEFG CEFH CEGH CFGH DEFG DEFH DEGH DFGH EFGH
I could take a random sample but being random, it doesn't exactly meet my requirements of taking a subset intentionally to get an even distribution. It could randomly choose a very uneven distribution.
Is there a tool or a mathematical formula to generate a list like I'm asking for? Building one in Python or some other coding language is an option if I had an idea of how to go about it.

You are asking the dealer to shuffle the deck.
The python standard library has a module, named random, containing a shuffle function. Present your eight options, shuffle them, and return the first four or however many you need. It will be random, obeying the distribution that you desire.
EDIT
I'm not sure how I could have expressed "shuffle" more clearly
but I will try, in math, in English and in code.
Draw a random permutation of 8 distinct elements and select the first 4.
Take a shuffled deck of 8 distinct cards, deal 4 of them, discard the rest.
#! /usr/bin/env python
from pprint import pp
import random
import matplotlib.pyplot as plt
import pandas as pd
import typer
class Options:
def __init__(self, all_options, k=4):
self.all_options = all_options
self.k = k
def new_deck(self):
deck = self.all_options.copy()
random.shuffle(deck)
return deck
def choose_options(self):
return self.new_deck()[: self.k]
def choose_many_options(self, n):
for _ in range(n):
yield "".join(self.choose_options())
def main(n: int = 10_000_000):
opt = Options(list("ABCDEFGH"))
demo = list(opt.choose_many_options(3))
pp(demo, width=22)
df = pd.DataFrame(opt.choose_many_options(n), columns=["opt"])
df["cnt"] = 1
with pd.option_context("display.min_rows", 16):
print(df.groupby("opt").sum())
cnts = df.groupby("opt").sum().cnt.tolist()
plt.plot(range(len(cnts)), cnts)
plt.gca().set_xlim((0, 1700))
plt.gca().set_ylim((0, None))
plt.gca().set_xlabel("combination of options")
plt.gca().set_ylabel("number of occurrences")
plt.show()
if __name__ == "__main__":
typer.run(main)
output:
['FABE',
'GEDC',
'FBAC']
cnt
opt
ABCD 6041
ABCE 5851
ABCF 6111
ABCG 5917
ABCH 6050
ABDC 5885
ABDE 5935
ABDF 5937
... ...
HGEC 5796
HGED 5922
HGEF 5859
HGFA 5936
HGFB 5880
HGFC 5869
HGFD 5942
HGFE 6049
[1680 rows x 1 columns]
P(n, k)
= P(8, 4) = n! / (n - k)!
= 40,320 / 24
= 1680
All possible combinations of options have been randomly drawn.
Here is the number of occurrences of each distinct draw.
Note that 5952 occurrences × 1680 gets us to ~ 10 million.
The PRNG arranged matters
"such that every option is used the same amount of times, or close to it."
Having repeatedly rolled a many-sided dice,
we see the anticipated mean and standard deviation
show up in the experimental results.

Related

How to generate all possible combinations from variables with values in Python

Summary:
I have around 50 variables, which all have a value. And I would like to get all possible combinations of variables with a maximum value.
For example: I have variables 'grape: € 0,1', 'apple: € 1', 'banana: € 2,5', 'strawberry: € 4', 'orange: € 5' etc. And I want to get all possible combinations that one can make when having € 5. Also, each variable can be picked once (for example not 5 x apple) and there is a maximum to the number of variables that can be picked.
Above example is a simplification of my problem.
Background:
I have not tried anything yet. I think I have to read in my variables as a dictionary. But for the rest I do not have a clue how to solve this problem in Python.
Code:
Not (yet) available
Expected output:
The output should be all possible combinations of variables that match the condition of containing maximum 'x' variables and represent maximum 'x' value and that each variable is picked not more than once.
Try this one to start with:
import itertools
x=[0.1, 1, 2.5, 4, 5]
N = 5
res = []
for i in range(len(x)):
for el in itertools.combinations(x, i+1):
if(sum(el)<=N and N-sum(el) < min([el_sub for el_sub in x if el_sub not in el] or [N])):
res.append(el)
print(el)
Result you will find in "res". If you want to restrict number of elements - play with the first for loop (and if statement accordingly).
I have also found a technical solution myself. But it takes a lot of processing power, to much I would say. In my case there were 48 variables and I wanted all possible combinations with 20 of them. I believe there are around 9.000.000.000 possible combinations. So Python could not handle this properly. Anyway, my code is below (for 26 fruits and all combinations of 12 of them):
from itertools import combinations
fruits = [
{'A':1.5},
{'B':1.5},
{'C':0.75},
{'D':1.5},
{'E':2.5},
{'F':3.5},
{'G':1},
{'H':0.5},
{'I':2.5},
{'J':1},
{'K':0.75},
{'L':3},
{'M':0.5},
{'N':0.75},
{'O':1},
{'P':1},
{'Q':1.5},
{'R':2},
{'S':3.5},
{'T':3},
{'U':0.75},
{'V':2},
{'W':2},
{'X':1.5},
{'Y':4},
{'Z':1.5}
]
combis = list(combinations(fruits,12))
for combi in combis:
total = 0
for fruit in combi:
fruit = fruit.values()
string = (list(value))
integer = (string[0])
total = total + integer
if total == 23.5:
print(list(combi[0].keys())[0]), print(list(combi[1].keys())[0]), print(list(combi[2].keys())[0]),
print(list(combi[3].keys())[0]), print(list(combi[4].keys())[0]), print(list(combi[5].keys())[0]),
print(list(combi[6].keys())[0]), print(list(combi[7].keys())[0]), print(list(combi[8].keys())[0]),
print(list(combi[9].keys())[0]), print(list(combi[10].keys())[0]), print(list(combi[11].keys())[0]),
print(som), print('_______________')
else:
pass

Randomly split users into two groups in the percentage of 80:20

I have a list of ids called users and want to split them randomly into two groups by the percentage of 80:20.
For example i have a list of 100 users ids and randomly put 80 users into group1 and remaining 20 into group2
def getLevelForIncrementality(Object[] args) {
try {
if (args.length >= 1 && args[0]!="") {
String seed = args[0] + "Testing";
int rnd = Math.abs(seed.hashCode() % 100);
return (rnd >= 80 ? 2 : 1);
}
} catch (Exception e) { }
return 3;
}
I have tried from the above groovy code which gives me in the ratio of 82:18.
Can someone give me some insights or suggestions or alogrithms which can solve the above problem for millions of user ids.
You can use random.sample to randomly extract the needed number of elements:
import random
a = list(range(1000))
b = random.sample(a, int(len(a) * 0.8))
len(b)
800
If you have unique IDs, you can try to convert these lists of IDs to sets and differ them like this:
c = list(set(a) - set(b))
it can be also done using train_test_split of sklearn
import numpy as np
from sklearn.model_selection import train_test_split
X = list(np.arange(1000))
x_80_percent, x_20_percent = train_test_split(X, test_size =.20, shuffle = True)
In order to distribute data "on the fly" without creating large lists, you can use a small control list that will tell you how to part users into the two groups (by chunks of 5).
spread = []
while getNextUser():
if not spread
spread = [1,1,1,1,0] # number of 1s and 0s is 4 vs 1 (80%)
random.shuffle(spread)
if spread.pop():
# place on 80% side
else:
# place on 20% side
This will ensure a perfect 80:20 split every fifth user through with a maximum imbalance of 4. As more users are processed this imbalance will become less and less significant.
Worst cases:
19.2% instead of 20% after 99 users, corrects to perfect 20% at 100
19.9% after 999 users, corrects to perfect 20% at 1000
19.99% after 9999 users, corrects to perfect 20% at 10000
Note: you can change the number of 1s and 0s in the spread list to get a different proportion. e.g. [1,1,0] will give you 2 vs 1; [1,1,1,0] is 3 vs 1 (75:25); [1]*13+[0]*7 is 13 vs 7 (65:35)
You can generalize this into a generator that will do the proper calculations and initializations for you:
import random
from math import gcd
def spreadRatio(a,b):
d = gcd(a,b)
base = [True]*(a//d)+[False]*(b//d)
spread = []
while True:
if not spread:
spread = base.copy()
random.shuffle(spread)
yield spread.pop()
pareto = spreadRatio(80,20)
while getNextUser():
if next(pareto):
# place on 80% side
else:
# place on 20% side
This also works for spliting a list:
A = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] ## Sample List
l = (len(A)/10) *8 ## making 80 %
B = A[:int(l)] ## Getting 80% of list
C = A[int(l):] ## Getting remaining list

Python guitar fretboard pitch/frequency implementation

I'm working on an application which, to make my life easier, requires a numeric converter for notes into frequencies that do a certain number of notes per second, including chords.
I found this article which highlighted the frequencies of each note, which manually blended (with pyaudio) to make my own rendition of Smoke On The Water using the mapped sequence from the article for each note.
This would work, and I could create chords by creating parallel processes, though I have no way of converting a note or tab number into a specific pitch. Most of my data is in the form of:
0 3 5 0 3 6 5 0 3 5 3 0
Essentially, I require an equation or function which can return the frequency for the input, with 0 being an open E-low-string and each value increase by 1 is one-fret up the fretboard (1 = F).
Isn't there a blatant pattern?
I'd wish so, but I suspect sine waves are the suspect. Taking the difference of E to F is 5.1, and F to F# is 5.2 and finally, F# to G being 5.5.
Thanks for any help, it's greatly appreciated.
Isn't there a blatant pattern?
Yes, for music in general there is. Two adjacent notes are separated by a factor of 2^(1/12). Wikipedia - Twelfth root of two Wikipedia - Semitone. It tried this out on the numbers in your linked article and the pattern fit perfectly to the number of significant digits shown in the article.
EDIT
OP asked for some code. Here's a quick -- but verbosely documented -- shot at that:
# A semitone (half-step) is the twelfth root of two
# https://en.wikipedia.org/wiki/Semitone
# https://en.wikipedia.org/wiki/Twelfth_root_of_two
SEMITONE_STEP = 2 ** (1/12)
# Standard tuning for a guitar - EADGBE
LOW_E_FREQ = 82.4 # Baseline - low 'E' is 82.4Hz
# In standard tuning, we use the fifth fret to tune the next string
# except for the next-to-highest string where we use the fourth fret.
STRING_STEPS = [5, 5, 5, 4, 5]
# Number of frets can vary but we will just presume it's 24 frets
N_FRETS = 24
# This will be a list of the frequencies of all six strings,
# a list of six lists, where each list is that string's frequencies at each fret
fret_freqs = []
# Start with the low string as our reference point
# We just short-hand the math of multipliying by SEMITONE_STEP over and over
fret_freqs.append([LOW_E_FREQ * (SEMITONE_STEP ** n) for n in range(N_FRETS)])
# Now go through the upper strings and base of each lower-string's fret, just like
# when we are tuning a guitar
for tuning_fret in STRING_STEPS:
# Pick off the nth fret of the previous string and use it as our base frequency
base_freq = fret_freqs[-1][tuning_fret]
fret_freqs.append([base_freq * (SEMITONE_STEP ** n) for n in range(N_FRETS)])
for stringFreqs in fret_freqs:
# We don't need 14 decimal places of precision, thank you very much.
print(["{:.1f}".format(f) for f in stringFreqs])
Output of this:
['82.4', '87.3', '92.5', '98.0', '103.8', '110.0', '116.5', '123.5', '130.8', '138.6', '146.8', '155.6', '164.8', '174.6', '185.0', '196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1']
['110.0', '116.5', '123.5', '130.8', '138.6', '146.8', '155.6', '164.8', '174.6', '185.0', '196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3']
['146.8', '155.6', '164.8', '174.6', '185.0', '196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3']
['196.0', '207.6', '220.0', '233.1', '246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3', '587.3', '622.2', '659.2', '698.4', '739.9']
['246.9', '261.6', '277.2', '293.6', '311.1', '329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3', '587.3', '622.2', '659.2', '698.4', '739.9', '783.9', '830.5', '879.9', '932.2']
['329.6', '349.2', '370.0', '392.0', '415.3', '440.0', '466.1', '493.8', '523.2', '554.3', '587.3', '622.2', '659.2', '698.4', '739.9', '783.9', '830.5', '879.9', '932.2', '987.7', '1046.4', '1108.6', '1174.6', '1244.4']

How to programatically group lap times into teams to minimize difference?

Given the following (arbitrary) lap times:
John: 47.20
Mark: 51.14
Shellie: 49.95
Scott: 48.80
Jack: 46.60
Cheryl: 52.70
Martin: 57.65
Karl: 55.45
Yong: 52.30
Lynetta: 59.90
Sueann: 49.24
Tempie: 47.88
Mack: 51.11
Kecia: 53.20
Jayson: 48.90
Sanjuanita: 45.90
Rosita: 54.43
Lyndia: 52.38
Deloris: 49.90
Sophie: 44.31
Fleta: 58.12
Tai: 61.23
Cassaundra: 49.38 
Oren: 48.39
We're doing a go-kart endurance race, and the idea, rather than allowing team picking is to write a tool to process the initial qualifying times and then spit out the closest-matched groupings.
My initial investigation makes me feel like this is a clique graphing type situation, but having never played with graphing algorithms I feel rather out of my depth.
What would be the fastest/simplest method of generating groups of 3 people with the closest overall average lap time, so as to remove overall advantage/difference between them?
Is this something I can use networkx to achieve, and if so, how would I best define the graph given the dataset above?
When you're faced with a problem like this, one approach is always to leverage randomness.
While other folks say they think X or Y should work, I know my algorithm will converge to at least a local maxima. If you can show that any state space can be reached from any other via pairwise swapping (a property that is true for, say, the Travelling Salesperson Problem), then the algorithm will find the global optimum (given time).
Further, the algorithm attempts to minimize the standard deviation of the average times across the groups, so it provides a natural metric of how good an answer you're getting: Even if the result is non-exact, getting a standard deviation of 0.058 is probably more than close enough for your purposes.
Put another way: there may be an exact solution, but a randomized solution is usually easy to imagine, doesn't take long to code, can converge nicely, and is able to produce acceptable answers.
#!/usr/bin/env python3
import numpy as np
import copy
import random
data = [
(47.20,"John"),
(51.14,"Mark"),
(49.95,"Shellie"),
(48.80,"Scott"),
(46.60,"Jack"),
(52.70,"Cheryl"),
(57.65,"Martin"),
(55.45,"Karl"),
(52.30,"Yong"),
(59.90,"Lynetta"),
(49.24,"Sueann"),
(47.88,"Tempie"),
(51.11,"Mack"),
(53.20,"Kecia"),
(48.90,"Jayson"),
(45.90,"Sanjuanita"),
(54.43,"Rosita"),
(52.38,"Lyndia"),
(49.90,"Deloris"),
(44.31,"Sophie"),
(58.12,"Fleta"),
(61.23,"Tai"),
(49.38 ,"Cassaundra"),
(48.39,"Oren")
]
#Divide into initial groupings
NUM_GROUPS = 8
groups = []
for x in range(NUM_GROUPS): #Number of groups desired
groups.append(data[x*len(data)//NUM_GROUPS:(x+1)*len(data)//NUM_GROUPS])
#Ensure all groups have the same number of members
assert all(len(groups[0])==len(x) for x in groups)
#Get average time of a single group
def FitnessGroup(group):
return np.average([x[0] for x in group])
#Get standard deviation of all groups' average times
def Fitness(groups):
avgtimes = [FitnessGroup(x) for x in groups] #Get all average times
return np.std(avgtimes) #Return standard deviation of average times
#Initially, the best grouping is just the data
bestgroups = copy.deepcopy(groups)
bestfitness = Fitness(groups)
#Generate mutations of the best grouping by swapping two randomly chosen members
#between their groups
for x in range(10000): #Run a large number of times
groups = copy.deepcopy(bestgroups) #Always start from the best grouping
g1 = random.randint(0,len(groups)-1) #Choose a random group A
g2 = random.randint(0,len(groups)-1) #Choose a random group B
m1 = random.randint(0,len(groups[g1])-1) #Choose a random member from group A
m2 = random.randint(0,len(groups[g2])-1) #Choose a random member from group B
groups[g1][m1], groups[g2][m2] = groups[g2][m2], groups[g1][m1] #Swap 'em
fitness = Fitness(groups) #Calculate fitness of new grouping
if fitness<bestfitness: #Is it a better fitness?
bestfitness = fitness #Save fitness
bestgroups = copy.deepcopy(groups) #Save grouping
#Print the results
for g in bestgroups:
for m in g:
print("{0:15}".format(m[1]), end='')
print("{0:15.3f}".format(FitnessGroup(g)), end='')
print("")
print("Standard deviation of teams: {0:.3f}".format(bestfitness))
Running this a couple of times gives a standard deviation of 0.058:
Cheryl Kecia Oren 51.430
Tempie Mark Karl 51.490
Fleta Deloris Jack 51.540
Lynetta Scott Sanjuanita 51.533
Mack Rosita Sueann 51.593
Shellie Lyndia Yong 51.543
Jayson Sophie Tai 51.480
Martin Cassaundra John 51.410
Standard deviation of teams: 0.058
If I understand correctly, just sort the list of times and group the first three, next three, up through the top three.
EDIT: I didn't understand correctly
So, the idea is to take the N people and group them into N/3 teams, making the average times N/3 teams [rather than the 3 people within each team as I mistakenly interpreted] as close as possible. In this case, I think you could still start by sorting the N drivers in decreasing order of times. Then, initialize an empty list of N/3 teams. Then for each driver in decreasing order of lap time, assign them to the team with the smallest current total lap time (or one of these teams, in case of ties). This is a variant of a simple bin packing algorithm.
Here is a simple Python implementation:
times = [47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39]
Nteams = len(times)/3
team_times = [0] * Nteams
team_members = [[]] * Nteams
times = sorted(times,reverse=True)
for m in range(len(times)):
i = team_times.index(min(team_times))
team_times[i] += times[m]
team_members[i] = team_members[i] + [m]
for i in range(len(team_times)):
print(str(team_members[i]) + ": avg time " + str(round(team_times[i]/3,3)))
whose output is
[0, 15, 23]: avg time 51.593
[1, 14, 22]: avg time 51.727
[2, 13, 21]: avg time 51.54
[3, 12, 20]: avg time 51.6
[4, 11, 19]: avg time 51.48
[5, 10, 18]: avg time 51.32
[6, 9, 17]: avg time 51.433
[7, 8, 16]: avg time 51.327
(Note that the team members numbers refer to them in descending order of lap time, starting from 0, rather than to their original ordering).
One issue with this is that if the times varied dramatically, there is no hard restriction to make the number of players on each team exactly 3. However, for your purposes, maybe that's OK, if it makes the relay close, and its probably a rare occurrence when the spread in times is much less than the average time.
EDIT
If you do just want 3 players on each team, in all cases, then the code can be trivially modified to at each step find the team with the least total lap time that doesn't already have three assigned players. This requires a small modification in the main code block:
times = sorted(times,reverse=True)
for m in range(len(times)):
idx = -1
for i in range(Nteams):
if len(team_members[i]) < 3:
if (idx == -1) or (team_times[i] < team_times[idx]):
idx = i
team_times[idx] += times[m]
team_members[idx] = team_members[idx] + [m]
For the example problem in the question, the above solution is of course identical, because it did not try to fit more or less than 3 players per team.
The following algorithm appears to work pretty well. It takes the fastest and slowest people remaining and then finds the person in the middle so that the group average is closest to the global average. Since the extreme values are being used up first, the averages at the end shouldn't be that far off despite the limited selection pool.
from bisect import bisect
times = sorted([47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39])
average = lambda c: sum(c)/len(c)
groups = []
average_time = average(times)
while times:
group = [times.pop(0), times.pop()]
# target value for the third person for best average
target = average_time * 3 - sum(group)
index = min(bisect(times, target), len(times) - 1)
# adjust if the left value is better than the right
if index and abs(target - times[index-1]) < abs(target - times[index]):
index -= 1
group.append(times.pop(index))
groups.append(group)
# [44.31, 61.23, 48.9]
# [45.9, 59.9, 48.8]
# [46.6, 58.12, 49.9]
# [47.2, 57.65, 49.38]
# [47.88, 55.45, 51.14]
# [48.39, 54.43, 51.11]
# [49.24, 53.2, 52.3]
# [49.95, 52.7, 52.38]
The sorting and the iterated binary search are both O(n log n), so the total complexity is O(n log n). Unfortunately, expanding this to larger groups might be tough.
The simplest would probably be to just create 3 buckets--a fast bucket, a medium bucket, and a slow bucket--and assign entries to the buckets by their qualifying times.
Then team together the slowest of the slow, the fastest of the fast, and the median or mean of the mediums. (Not sure whether median or mean is the best choice off the top of my head.) Repeat until you're out of entries.

Grouping Similar Strings

I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,
Term Group
NBA Basketball 1
Basketball NBA 1
Basketball 1
Baseball 2
It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.
Thanks
You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).
nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.
import sys, operator
def tokenize(s, glen):
g2 = set()
for i in xrange(len(s)-(glen-1)):
g2.add(s[i:i+glen])
return g2
def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
def main():
GRAM_LEN = 4
scores = {}
for i in xrange(1,len(sys.argv)):
for j in xrange(i+1, len(sys.argv)):
s1 = sys.argv[i]
s2 = sys.argv[j]
score = dice(GRAM_LEN, s1, s2)
scores[s1+":"+s2] = score
for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
print item
When this program is run with your strings, the following similarity scores are produced:
./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)
At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.

Categories