How to generate all possible combinations from variables with values in Python - python

Summary:
I have around 50 variables, which all have a value. And I would like to get all possible combinations of variables with a maximum value.
For example: I have variables 'grape: € 0,1', 'apple: € 1', 'banana: € 2,5', 'strawberry: € 4', 'orange: € 5' etc. And I want to get all possible combinations that one can make when having € 5. Also, each variable can be picked once (for example not 5 x apple) and there is a maximum to the number of variables that can be picked.
Above example is a simplification of my problem.
Background:
I have not tried anything yet. I think I have to read in my variables as a dictionary. But for the rest I do not have a clue how to solve this problem in Python.
Code:
Not (yet) available
Expected output:
The output should be all possible combinations of variables that match the condition of containing maximum 'x' variables and represent maximum 'x' value and that each variable is picked not more than once.

Try this one to start with:
import itertools
x=[0.1, 1, 2.5, 4, 5]
N = 5
res = []
for i in range(len(x)):
for el in itertools.combinations(x, i+1):
if(sum(el)<=N and N-sum(el) < min([el_sub for el_sub in x if el_sub not in el] or [N])):
res.append(el)
print(el)
Result you will find in "res". If you want to restrict number of elements - play with the first for loop (and if statement accordingly).

I have also found a technical solution myself. But it takes a lot of processing power, to much I would say. In my case there were 48 variables and I wanted all possible combinations with 20 of them. I believe there are around 9.000.000.000 possible combinations. So Python could not handle this properly. Anyway, my code is below (for 26 fruits and all combinations of 12 of them):
from itertools import combinations
fruits = [
{'A':1.5},
{'B':1.5},
{'C':0.75},
{'D':1.5},
{'E':2.5},
{'F':3.5},
{'G':1},
{'H':0.5},
{'I':2.5},
{'J':1},
{'K':0.75},
{'L':3},
{'M':0.5},
{'N':0.75},
{'O':1},
{'P':1},
{'Q':1.5},
{'R':2},
{'S':3.5},
{'T':3},
{'U':0.75},
{'V':2},
{'W':2},
{'X':1.5},
{'Y':4},
{'Z':1.5}
]
combis = list(combinations(fruits,12))
for combi in combis:
total = 0
for fruit in combi:
fruit = fruit.values()
string = (list(value))
integer = (string[0])
total = total + integer
if total == 23.5:
print(list(combi[0].keys())[0]), print(list(combi[1].keys())[0]), print(list(combi[2].keys())[0]),
print(list(combi[3].keys())[0]), print(list(combi[4].keys())[0]), print(list(combi[5].keys())[0]),
print(list(combi[6].keys())[0]), print(list(combi[7].keys())[0]), print(list(combi[8].keys())[0]),
print(list(combi[9].keys())[0]), print(list(combi[10].keys())[0]), print(list(combi[11].keys())[0]),
print(som), print('_______________')
else:
pass

Related

Get an evenly distributed subset of combinations without repetition

I'm trying to get a subset of combinations such that every option is used the same amount of times, or close to it, from the total set of combinations without repetition. For example, I have 8 options (let's say A-H) and I need combinations of 4 letters where order doesn't matter. That would give me 70 possible combinations. I would like to take a subset of those combinations such that A appears as much as each other letter does, and A appears with B as much as C appears with D, etc. I know there are subsets where it is impossible to have each letter appear the same amount of times and appear with another letter the same amount of times so when I say "same amount of times" in this post, I mean the same amount or close to it.
If the options are written out in an organized list as is shown below, I couldn't just select the first N options because that would give A far more use than it would H. Also, A and B would appear together more than C and D. The main idea is to get as evenly distributed use of each letter combination as possible.
ABCD ABCE ABCF ABCG ABCH ABDE ABDF ABDG ABDH ABEF ABEG ABEH ABFG ABFH ABGH ACDE ACDF ACDG ACDH ACEF ACEG ACEH ACFG ACFH ACGH ADEF ADEG ADEH ADFG ADFH ADGH AEFG AEFH AEGH AFGH BCDE BCDF BCDG BCDH BCEF BCEG BCEH BCFG BCFH BCGH BDEF BDEG BDEH BDFG BDFH BDGH BEFG BEFH BEGH BFGH CDEF CDEG CDEH CDFG CDFH CDGH CEFG CEFH CEGH CFGH DEFG DEFH DEGH DFGH EFGH
I could take a random sample but being random, it doesn't exactly meet my requirements of taking a subset intentionally to get an even distribution. It could randomly choose a very uneven distribution.
Is there a tool or a mathematical formula to generate a list like I'm asking for? Building one in Python or some other coding language is an option if I had an idea of how to go about it.
You are asking the dealer to shuffle the deck.
The python standard library has a module, named random, containing a shuffle function. Present your eight options, shuffle them, and return the first four or however many you need. It will be random, obeying the distribution that you desire.
EDIT
I'm not sure how I could have expressed "shuffle" more clearly
but I will try, in math, in English and in code.
Draw a random permutation of 8 distinct elements and select the first 4.
Take a shuffled deck of 8 distinct cards, deal 4 of them, discard the rest.
#! /usr/bin/env python
from pprint import pp
import random
import matplotlib.pyplot as plt
import pandas as pd
import typer
class Options:
def __init__(self, all_options, k=4):
self.all_options = all_options
self.k = k
def new_deck(self):
deck = self.all_options.copy()
random.shuffle(deck)
return deck
def choose_options(self):
return self.new_deck()[: self.k]
def choose_many_options(self, n):
for _ in range(n):
yield "".join(self.choose_options())
def main(n: int = 10_000_000):
opt = Options(list("ABCDEFGH"))
demo = list(opt.choose_many_options(3))
pp(demo, width=22)
df = pd.DataFrame(opt.choose_many_options(n), columns=["opt"])
df["cnt"] = 1
with pd.option_context("display.min_rows", 16):
print(df.groupby("opt").sum())
cnts = df.groupby("opt").sum().cnt.tolist()
plt.plot(range(len(cnts)), cnts)
plt.gca().set_xlim((0, 1700))
plt.gca().set_ylim((0, None))
plt.gca().set_xlabel("combination of options")
plt.gca().set_ylabel("number of occurrences")
plt.show()
if __name__ == "__main__":
typer.run(main)
output:
['FABE',
'GEDC',
'FBAC']
cnt
opt
ABCD 6041
ABCE 5851
ABCF 6111
ABCG 5917
ABCH 6050
ABDC 5885
ABDE 5935
ABDF 5937
... ...
HGEC 5796
HGED 5922
HGEF 5859
HGFA 5936
HGFB 5880
HGFC 5869
HGFD 5942
HGFE 6049
[1680 rows x 1 columns]
P(n, k)
= P(8, 4) = n! / (n - k)!
= 40,320 / 24
= 1680
All possible combinations of options have been randomly drawn.
Here is the number of occurrences of each distinct draw.
Note that 5952 occurrences × 1680 gets us to ~ 10 million.
The PRNG arranged matters
"such that every option is used the same amount of times, or close to it."
Having repeatedly rolled a many-sided dice,
we see the anticipated mean and standard deviation
show up in the experimental results.

How to filter elements of Cartesian product following specific ordering conditions

I have to generate multiple reactions with different variables. They have 3 elements. Let's call them B, S and H. And they all start with B1. S can be appended to the element if there is at least one B. So it can be B1S1 or B2S2 or B2S1 etc... but not B1S2. The same goes for H. B1S1H1 or B2S2H1 or B4S1H1 but never B2S2H3. The final variation would be B5S5H5. I tried with itertools.product. But I don't know how to get rid of the elements that don't match my condition and how to add the next element. Here is my code:
import itertools
a = list(itertools.product([1, 2, 3, 4], repeat=4))
#print (a)
met = open('random_dat.dat', 'w')
met.write('Reactions')
met.write('\n')
for i in range(1,256):
met.write('\n')
met.write('%s: B%sS%sH%s -> B%sS%sH%s' %(i, a[i][3], a[i][2], a[i][1], a[i][3], a[i][2], a[i][1]))
met.write('\n')
met.close()
Simple for loops will do what you want:
bsh = []
for b in range(1,6):
for s in range(1,b+1):
for h in range(1,b+1):
bsh.append( f"B{b}S{s}H{h}" )
print(bsh)
Output:
['B1S1H1', 'B2S1H1', 'B2S1H2', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S1H2', 'B3S1H3',
'B3S2H1', 'B3S2H2', 'B3S2H3', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S1H2',
'B4S1H3', 'B4S1H4', 'B4S2H1', 'B4S2H2', 'B4S2H3', 'B4S2H4', 'B4S3H1', 'B4S3H2',
'B4S3H3', 'B4S3H4', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S1H2',
'B5S1H3', 'B5S1H4', 'B5S1H5', 'B5S2H1', 'B5S2H2', 'B5S2H3', 'B5S2H4', 'B5S2H5',
'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S3H4', 'B5S3H5', 'B5S4H1', 'B5S4H2', 'B5S4H3',
'B5S4H4', 'B5S4H5', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
Thanks to #mikuszefski for pointing out improvements.
Patrick his answer in list comprehension style
bsh = [f"B{b}S{s}H{h}" for b in range(1,5) for s in range(1,b+1) for h in range(1,b+1)]
Gives
['B1S1H1',
'B2S1H1',
'B2S1H2',
'B2S2H1',
'B2S2H2',
'B3S1H1',
'B3S1H2',
'B3S1H3',
'B3S2H1',
'B3S2H2',
'B3S2H3',
'B3S3H1',
'B3S3H2',
'B3S3H3',
'B4S1H1',
'B4S1H2',
'B4S1H3',
'B4S1H4',
'B4S2H1',
'B4S2H2',
'B4S2H3',
'B4S2H4',
'B4S3H1',
'B4S3H2',
'B4S3H3',
'B4S3H4',
'B4S4H1',
'B4S4H2',
'B4S4H3',
'B4S4H4']
I would implement your "use itertools.product and get rid off unnecessary elements" solution following way:
import itertools
a = list(itertools.product([1,2,3,4,5],repeat=3))
a = [i for i in a if (i[1]<=i[0] and i[2]<=i[1] and i[2]<=i[0])]
Note that I assumed last elements needs to be smaller or equal than any other. Note that a is now list of 35 tuples each holding 3 ints. So you need to made strs of them for example using so-called f-string:
a = [f"B{i[0]}S{i[1]}H{i[2]}" for i in a]
print(a)
output:
['B1S1H1', 'B2S1H1', 'B2S2H1', 'B2S2H2', 'B3S1H1', 'B3S2H1', 'B3S2H2', 'B3S3H1', 'B3S3H2', 'B3S3H3', 'B4S1H1', 'B4S2H1', 'B4S2H2', 'B4S3H1', 'B4S3H2', 'B4S3H3', 'B4S4H1', 'B4S4H2', 'B4S4H3', 'B4S4H4', 'B5S1H1', 'B5S2H1', 'B5S2H2', 'B5S3H1', 'B5S3H2', 'B5S3H3', 'B5S4H1', 'B5S4H2', 'B5S4H3', 'B5S4H4', 'B5S5H1', 'B5S5H2', 'B5S5H3', 'B5S5H4', 'B5S5H5']
However you might also use another methods of formatting instead of f-string if you wish.

Python: Given a group of strings uniformly bucket them into k buckets so that same strings go to the same bucket

I have a set (of 2000) rows with many elements per row. One of the elements in a row is a string ("name") that is common per group of 5 rows (the total number of unique names is 500).
I want rows with the same "name" to end up in the same bucket. So the function should always return the same value for the given input.
I want to use it for a k-fold cross validation, so I need to create k buckets with numbers of elements as uniformly distributed as possible, +/- few elements is fine but more than 10% is not.
For k = 10 I should have 10 buckets with 200 elements in each, 190 or 210 is ok, but 250 and 180 is not. I tried this answer but it did not give me a very uniform result. This may be due to a dataset itself but it would be great to have somewhat balanced number of elements per bucket. K is usually either 5 or 10.
An example:
name1, date1_1, location1_1, number1_1
name1, date1_2, location1_2, number1_2
...
name1, date1_5, location1_5, number1_5
name2, date2_1, location2_1, number2_1
...
name2, date2_5, location2_5, number2_5
...
name400, date400_1, location400_1, number400_1
...
name400, date400_5, location400_5, number400_5
Output example:
i,name1, date1_1, location1_1, number1_1
i,name1, date1_2, location1_2, number1_2
...
i,name1, date1_5, location1_5, number1_5
j,name2, date2_1, location2_1, number2_1
...
j,name2, date2_5, location2_5, number2_5
...
k,name400, date400_1, location400_1, number400_1
...
k,name400, date400_5, location400_5, number400_5
where 1 < i, j, k < K (K = 5 or K = 10)
What you want is a hash-table, yes? In that case just create a dictionary of size K, and devise a hash-function that takes your string as input and comes back with the index. In the example you provided, an appropriate one might be:
h = int(name.split(',')[0].strip("name")) % K
To be fair, this is pretty naive and doesn't take into account the distribution of your names (you could have many with name1 and very few with name400 for example) but if they are more-or-less the same then that method should work reasonably well.
If your names aren't as convenient as that, you could create a secondary table that simply takes in your name and spits out a number. For instance, suppose you had the names: "Bob", "Sally", "Larry", ...
nameIndexMappings = {"Bob" : 0, "Sally" : 1, "Larry" : 2}
h = nameIndexMappings[name.split(',')[0]] % K
Then you can setup another dictionary like this:
rowMapping = dict()
index = 0
for i in range(0, K):
rowMapping[i] = list()
for row in rows:
name = row.split(',')[0]
if (name not in nameIndexMappings):
nameIndexMappings[index] = name
index += 1
h = nameIndexMappings[name] % K
rowMapping[h].append(row)
After doing this, rowMapping should contain K lists each with about the same number of elements in them (assuming, of course, that all your names are more-or-less equally distributed).
What you are asking for isn't feasible without more constraints.
Imagine if your input consisted of the input string "A" N times, with N arbitrarily large, and input string "B" only 1 times. What would like the output to be?
In any case what you want to do is solve is bin-packing optimization problem.

How to programatically group lap times into teams to minimize difference?

Given the following (arbitrary) lap times:
John: 47.20
Mark: 51.14
Shellie: 49.95
Scott: 48.80
Jack: 46.60
Cheryl: 52.70
Martin: 57.65
Karl: 55.45
Yong: 52.30
Lynetta: 59.90
Sueann: 49.24
Tempie: 47.88
Mack: 51.11
Kecia: 53.20
Jayson: 48.90
Sanjuanita: 45.90
Rosita: 54.43
Lyndia: 52.38
Deloris: 49.90
Sophie: 44.31
Fleta: 58.12
Tai: 61.23
Cassaundra: 49.38 
Oren: 48.39
We're doing a go-kart endurance race, and the idea, rather than allowing team picking is to write a tool to process the initial qualifying times and then spit out the closest-matched groupings.
My initial investigation makes me feel like this is a clique graphing type situation, but having never played with graphing algorithms I feel rather out of my depth.
What would be the fastest/simplest method of generating groups of 3 people with the closest overall average lap time, so as to remove overall advantage/difference between them?
Is this something I can use networkx to achieve, and if so, how would I best define the graph given the dataset above?
When you're faced with a problem like this, one approach is always to leverage randomness.
While other folks say they think X or Y should work, I know my algorithm will converge to at least a local maxima. If you can show that any state space can be reached from any other via pairwise swapping (a property that is true for, say, the Travelling Salesperson Problem), then the algorithm will find the global optimum (given time).
Further, the algorithm attempts to minimize the standard deviation of the average times across the groups, so it provides a natural metric of how good an answer you're getting: Even if the result is non-exact, getting a standard deviation of 0.058 is probably more than close enough for your purposes.
Put another way: there may be an exact solution, but a randomized solution is usually easy to imagine, doesn't take long to code, can converge nicely, and is able to produce acceptable answers.
#!/usr/bin/env python3
import numpy as np
import copy
import random
data = [
(47.20,"John"),
(51.14,"Mark"),
(49.95,"Shellie"),
(48.80,"Scott"),
(46.60,"Jack"),
(52.70,"Cheryl"),
(57.65,"Martin"),
(55.45,"Karl"),
(52.30,"Yong"),
(59.90,"Lynetta"),
(49.24,"Sueann"),
(47.88,"Tempie"),
(51.11,"Mack"),
(53.20,"Kecia"),
(48.90,"Jayson"),
(45.90,"Sanjuanita"),
(54.43,"Rosita"),
(52.38,"Lyndia"),
(49.90,"Deloris"),
(44.31,"Sophie"),
(58.12,"Fleta"),
(61.23,"Tai"),
(49.38 ,"Cassaundra"),
(48.39,"Oren")
]
#Divide into initial groupings
NUM_GROUPS = 8
groups = []
for x in range(NUM_GROUPS): #Number of groups desired
groups.append(data[x*len(data)//NUM_GROUPS:(x+1)*len(data)//NUM_GROUPS])
#Ensure all groups have the same number of members
assert all(len(groups[0])==len(x) for x in groups)
#Get average time of a single group
def FitnessGroup(group):
return np.average([x[0] for x in group])
#Get standard deviation of all groups' average times
def Fitness(groups):
avgtimes = [FitnessGroup(x) for x in groups] #Get all average times
return np.std(avgtimes) #Return standard deviation of average times
#Initially, the best grouping is just the data
bestgroups = copy.deepcopy(groups)
bestfitness = Fitness(groups)
#Generate mutations of the best grouping by swapping two randomly chosen members
#between their groups
for x in range(10000): #Run a large number of times
groups = copy.deepcopy(bestgroups) #Always start from the best grouping
g1 = random.randint(0,len(groups)-1) #Choose a random group A
g2 = random.randint(0,len(groups)-1) #Choose a random group B
m1 = random.randint(0,len(groups[g1])-1) #Choose a random member from group A
m2 = random.randint(0,len(groups[g2])-1) #Choose a random member from group B
groups[g1][m1], groups[g2][m2] = groups[g2][m2], groups[g1][m1] #Swap 'em
fitness = Fitness(groups) #Calculate fitness of new grouping
if fitness<bestfitness: #Is it a better fitness?
bestfitness = fitness #Save fitness
bestgroups = copy.deepcopy(groups) #Save grouping
#Print the results
for g in bestgroups:
for m in g:
print("{0:15}".format(m[1]), end='')
print("{0:15.3f}".format(FitnessGroup(g)), end='')
print("")
print("Standard deviation of teams: {0:.3f}".format(bestfitness))
Running this a couple of times gives a standard deviation of 0.058:
Cheryl Kecia Oren 51.430
Tempie Mark Karl 51.490
Fleta Deloris Jack 51.540
Lynetta Scott Sanjuanita 51.533
Mack Rosita Sueann 51.593
Shellie Lyndia Yong 51.543
Jayson Sophie Tai 51.480
Martin Cassaundra John 51.410
Standard deviation of teams: 0.058
If I understand correctly, just sort the list of times and group the first three, next three, up through the top three.
EDIT: I didn't understand correctly
So, the idea is to take the N people and group them into N/3 teams, making the average times N/3 teams [rather than the 3 people within each team as I mistakenly interpreted] as close as possible. In this case, I think you could still start by sorting the N drivers in decreasing order of times. Then, initialize an empty list of N/3 teams. Then for each driver in decreasing order of lap time, assign them to the team with the smallest current total lap time (or one of these teams, in case of ties). This is a variant of a simple bin packing algorithm.
Here is a simple Python implementation:
times = [47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39]
Nteams = len(times)/3
team_times = [0] * Nteams
team_members = [[]] * Nteams
times = sorted(times,reverse=True)
for m in range(len(times)):
i = team_times.index(min(team_times))
team_times[i] += times[m]
team_members[i] = team_members[i] + [m]
for i in range(len(team_times)):
print(str(team_members[i]) + ": avg time " + str(round(team_times[i]/3,3)))
whose output is
[0, 15, 23]: avg time 51.593
[1, 14, 22]: avg time 51.727
[2, 13, 21]: avg time 51.54
[3, 12, 20]: avg time 51.6
[4, 11, 19]: avg time 51.48
[5, 10, 18]: avg time 51.32
[6, 9, 17]: avg time 51.433
[7, 8, 16]: avg time 51.327
(Note that the team members numbers refer to them in descending order of lap time, starting from 0, rather than to their original ordering).
One issue with this is that if the times varied dramatically, there is no hard restriction to make the number of players on each team exactly 3. However, for your purposes, maybe that's OK, if it makes the relay close, and its probably a rare occurrence when the spread in times is much less than the average time.
EDIT
If you do just want 3 players on each team, in all cases, then the code can be trivially modified to at each step find the team with the least total lap time that doesn't already have three assigned players. This requires a small modification in the main code block:
times = sorted(times,reverse=True)
for m in range(len(times)):
idx = -1
for i in range(Nteams):
if len(team_members[i]) < 3:
if (idx == -1) or (team_times[i] < team_times[idx]):
idx = i
team_times[idx] += times[m]
team_members[idx] = team_members[idx] + [m]
For the example problem in the question, the above solution is of course identical, because it did not try to fit more or less than 3 players per team.
The following algorithm appears to work pretty well. It takes the fastest and slowest people remaining and then finds the person in the middle so that the group average is closest to the global average. Since the extreme values are being used up first, the averages at the end shouldn't be that far off despite the limited selection pool.
from bisect import bisect
times = sorted([47.20, 51.14, 49.95, 48.80, 46.60, 52.70, 57.65, 55.45, 52.30, 59.90, 49.24, 47.88, 51.11, 53.20, 48.90, 45.90, 54.43, 52.38, 49.90, 44.31, 58.12, 61.23, 49.38, 48.39])
average = lambda c: sum(c)/len(c)
groups = []
average_time = average(times)
while times:
group = [times.pop(0), times.pop()]
# target value for the third person for best average
target = average_time * 3 - sum(group)
index = min(bisect(times, target), len(times) - 1)
# adjust if the left value is better than the right
if index and abs(target - times[index-1]) < abs(target - times[index]):
index -= 1
group.append(times.pop(index))
groups.append(group)
# [44.31, 61.23, 48.9]
# [45.9, 59.9, 48.8]
# [46.6, 58.12, 49.9]
# [47.2, 57.65, 49.38]
# [47.88, 55.45, 51.14]
# [48.39, 54.43, 51.11]
# [49.24, 53.2, 52.3]
# [49.95, 52.7, 52.38]
The sorting and the iterated binary search are both O(n log n), so the total complexity is O(n log n). Unfortunately, expanding this to larger groups might be tough.
The simplest would probably be to just create 3 buckets--a fast bucket, a medium bucket, and a slow bucket--and assign entries to the buckets by their qualifying times.
Then team together the slowest of the slow, the fastest of the fast, and the median or mean of the mediums. (Not sure whether median or mean is the best choice off the top of my head.) Repeat until you're out of entries.

Discover different lines across similar files

I have a text file with many tens of thousands short sentences like this:
go to venice
come back from grece
new york here i come
from belgium to russia and back to spain
I run a tagging algorithm which produces a tagged output of this sentence file:
go to <place>venice</place>
come back from <place>grece</place>
<place>new york</place> here i come
from <place>belgium</place> to <place>russia</place> and back to <place>spain</place>
The algorithm runs over the input multiple times and produces each time slightly different tagging. My goal is to identify those lines where those differences occur. In other words, print all utterances for which the tagging differs across N results files.
For example N=10, I get 10 tagged files. Suppose line 1 is tagged all the time the same for all 10 tagged files - do not print it. Suppose line 2 is tagged once this way and 9 times other way - print it. And so on.
For N=2 is easy, I just run diff. But what to do if I have N=10 results?
If you have the tagged files - just create a counter for each line of how many times you've seen it:
# use defaultdict for convenience
from collections import defaultdict
# start counting at 0
counter_dict = defaultdict(lambda: 0)
tagged_file_names = ['tagged1.txt', 'tagged2.txt', ...]
# add all lines of each file to dict
for file_name in tagged_file_names:
with open(file_name) as f:
# use enumerate to maintain order
# produces (LINE_NUMBER, LINE CONTENT) tuples (hashable)
for line_with_number in enumerate(f.readlines()):
counter_dict[line_with_number] += 1
# print all values that do not repeat in all files (in same location)
for key, value in counter_dict.iteritems():
if value < len(tagged_file_names):
print "line number %d: [%s] only repeated %d times" % (
key[0], key[1].strip(), value
)
Walkthrough:
First of all, we create a data structure to enable us counting our entries, which are numbered lines. This data structure is a collections.defaultdict which a default value of 0 - which is the count of newly added lines (increased to 1 with each add).
Then, we create the actual entry using a tuple which is hashable, so it can be used as a dictionary key, and by default deeply-comparable to other tuples. this means (1, "lolz") is equal to (1, "lolz") but different than (1, "not lolz") or (2, lolz) - so it fits our use of deep-comparing lines to account for content as well as position.
Now all that's left to do is add all entries using a straightforward for loop and see what keys (which correspond to numbered lines) appear in all files (that is - their value is equal to the number of tagged files provided).
Example:
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged1.txt
123
abc
def
reut#tHP-EliteBook-8470p:~/python/counter$ cat tagged2.txt
123
def
def
reut#tHP-EliteBook-8470p:~/python/counter$ ./difference_counter.py
line number 1: [abc] only repeated 1 times
line number 1: [def] only repeated 1 times
if you compare all of them to the first text, then you can get a list of all texts that are different. this might not be the quickest way but it would work.
import difflib
n1 = '1 2 3 4 5 6'
n2 = '1 2 3 4 5 6'
n3 = '1 2 4 5 6 7'
l = [n1, n2, n3]
m = [x for x in l if x != l[0]]
diff = difflib.unified_diff(l[0], l.index(m))
print ''.join(diff)

Categories