Extending grouping code to handle more general inputs

Extending grouping code to handle more general inputs - python

I need to group either a list of floats, or a list of (named)tuples of varying length, into groups based on whether or not the key is bigger or smaller than a given value.
For example given a list of powers of 2 less than 1, and a list of cutoffs:
twos = [2**(-(i+1)) for i in range(0,10)]
cutoffs = [0.5, 0.125, 0.03125]
Then function
split_into_groups(twos, cutoffs)
should return
[[0.5], [0.25, 0.125], [0.0625, 0.03125], [0.015625, 0.0078125, 0.00390625, 0.001953125, 0.0009765625]]
I've implemented the function like this:
def split_by_prob(items, cutoff, groups, key=None):
for k,g in groupby(enumerate(items), lambda (j,x): x<cutoff):
groups.append((map(itemgetter(1),g)))
return groups
def split_into_groups(items, cutoffs, key=None):
groups = items
final = []
for i in cutoffs:
groups = split_by_prob(groups,i,[],key)
if len(groups) > 1:
final.append(groups[0])
groups = groups.pop()
else:
final.append(groups[0])
return final
final.append(groups)
return final
The tests that these currently pass are:
>>> split_by_prob(twos, 0.5, [])
[[0.5], [0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.00390625, 0.001953125, 0.0009765625]]
>>> split_into_groups(twos, cutoffs)
[[0.5], [0.25, 0.125], [0.0625, 0.03125], [0.015625, 0.0078125, 0.00390625, 0.001953125, 0.0009765625]]
>>> split_into_groups(twos, cutoffs_p10)
[[0.5, 0.25, 0.125], [0.0625, 0.03125, 0.015625], [0.0078125, 0.00390625, 0.001953125], [0.0009765625]]
Where cutoffs_p10 = [10**(-(i+1)) for i in range(0,5)]
I can straightforwardly extend this to a list of tuples of the form
items = zip(range(0,10), twos)
by changing
def split_by_prob(items, cutoff, groups, key=None):
for k,g in groupby(enumerate(items), lambda (j,x): x<cutoff):
groups.append((map(itemgetter(1),g)))
return groups
to
def split_by_prob(items, cutoff, groups, key=None):
for k,g in groupby(enumerate(items), lambda (j,x): x[1]<cutoff):
groups.append((map(itemgetter(1),g)))
return groups
How do I go about extending the original method by adding a key that defaults to a list of floats (or ints etc) but one that could handle tuples and namedtuples?
For example something like:
split_into_groups(items, cutoffs, key=items[0])
would return
[[(0,0.5)], [(1,0.25), (2,0.125)], [(3,0.0625), (4,0.03125)], [(5,0.015625), (6,0.0078125), (7,0.00390625), (8,0.001953125), (9,0.0009765625)]]

In my answer I assume, the cutoffs are at the end in increasing order - just to simplify the situation.
Discriminator detecting a slot
class Discriminator(object):
def __init__(self, cutoffs):
self.cutoffs = sorted(cutoffs)
self.maxslot = len(cutoffs)
def findslot(self, num):
cutoffs = self.cutoffs
for slot, edge in enumerate(self.cutoffs):
if num < edge:
return slot
return self.maxslot
grouper to put items into slots
from collections import defaultdict
def grouper(cutoffs, items, key=None):
if not key:
key = lambda itm: itm
discr = Discriminator(cutoffs)
result = defaultdict(list)
for item in items:
num = key(item)
result[discr.findslot(num)].append(item)
return result
def split_into_groups(cutoffs, numbers, key=None):
groups = grouper(cutoffs, numbers, key)
slot_ids = sorted(groups.keys())
return [groups[slot_id] for slot_id in slot_ids]
Conclusions about Discriminator and grouper
Proposed Discriminator works even for unsorted items.
Conclusions about key
In fact, providing the key functions is easier, than it originally looked.
It is just a function provided via parameter, so it becomes an alias for the transformation function to call to get the value, we want to use for comparing, grouping etc.
There is special case of None, for such a situation we have to use some identity function.
Simplest one is
func = lambda itm: itm
Note: all the functions above were tested by a test suite (incl. use of key function, but I removed it from this answer as it was becoming far too long.

Related

Sum of duplicate values in 2d array

So, I'm sure similar questions have been asked before but I couldn't find quite what I need.
I have a program that outputs a 2D array like the one below:
arr = [[0.2, 3], [0.3, "End"], ...]
There may be more or less elements, but each is a 2-element array, where the first value is a float and the second can be a float or a string.
Both of those values may repeat. In each of those arrays, the second element takes on only a few possible values.
What I want to do is sum the first elements' value within the arrays that have the same value of the second element and output a similar array that does not have those duplicated values.
For example:
input = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
output = [[0.5, 1.5], [0.4, 3.5], [0.85, "End"]]
I'd appreciate if the output array was sorted by this second element (floats ascending, strings at the end), although it's not necessary.
EDIT: Thanks for both answers; I've decided to use the one by Chris, because the code was more comprehensible to me, although groupby seems like a function designed to solved this very problem, so I'll try to read up on that, too.
UPDATE: The values of floats were always positive, by nature of the task at hand, so I used negative values to stop the usage of any strings - now I have a few if statements that check for those "encoded" negative values and replace them with strings again just before they're printed out, so sorting is now easier.

You could use a dictionary to accumulate the sum of the first value in the list keyed by the second item.
To get the 'string' items at the end of the list, the sort key could be set to positive infinity, float('inf'), in the sort key .
input_ = [[0.4, 1.5], [0.1, 1.5], [0.8, "End"], [0.05, "End"], [0.2, 3.5], [0.2, 3.5]]
d = dict()
for pair in input_:
d[pair[1]] = d.get(pair[1], 0) + pair[0]
L = []
for k, v in d.items():
L.append([v,k])
L.sort(key=lambda x: x[1] if type(x[1]) == float else float('inf'))
print(L)
This prints:
[[0.5, 1.5], [0.4, 3.5], [0.8500000000000001, 'End']]

You can try to play with itertools.groupby:
import itertools
out = [[key, sum([elt[0]for elt in val])] for key, val in itertools.groupby(a, key=lambda elt: elt[1])]
>>> [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Explanation:
Groupby the 2D list according to the 2nd element of each sublist using itertools.groupby and the key parameters. We define the lambda key=lambda elt: elt[1] to groupby on the 2nd element:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(key, val)
# 1.5 <itertools._grouper object at 0x0000026AD1F6E160>
# End <itertools._grouper object at 0x0000026AD2104EF0>
# 3.5 <itertools._grouper object at 0x0000026AD1F6E160>
For each value of the group, compute the sum using the buildin function sum:
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
print(sum([elt[0]for elt in val]))
# 0.5
# 0.8500000000000001
# 0.4
Compute the desired output:
out = []
for key, val in itertools.groupby(a, key=lambda elt: elt[1]):
out.append([sum([elt[0]for elt in val]), key])
print(out)
# [[0.5, 1.5], [0.8500000000000001, 'End'], [0.4, 3.5]]
Then you said about sorting on the 2nd value but there are strings and numbers, it's quite a problem for the computer. It can't make a choice between a number and a string. Objects must be comparable.

Python: Removing list duplicates based on first 2 inner list values

Question:
I have a list in the following format:
x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
The algorithm:
Combine all inner lists with the same starting 2 values, the third value doesn't have to be the same to combine them
e.g. "hello",0,5 is combined with "hello",0,8
But not combined with "hello",1,1
The 3rd value becomes the average of the third values: sum(all 3rd vals) / len(all 3rd vals)
Note: by all 3rd vals I am referring to the 3rd value of each inner list of duplicates
e.g. "hello",0,5 and "hello",0,8 becomes hello,0,6.5
Desired output: (Order of list doesn't matter)
x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]
Question:
How can I implement this algorithm in Python?
Ideally it would be efficient as this will be used on very large lists.
If anything is unclear let me know and I will explain.
Edit: I have tried to change the list to a set to remove duplicates, however this doesn't account for the third variable in the inner lists and therefore doesn't work.
Solution Performance:
Thanks to everyone who has provided a solution to this problem! Here
are the results based on a speed test of all the functions:

Update using running sum and count
I figured out how to improve my previous code (see original below). You can keep running totals and counts, then compute the averages at the end, which avoids recording all the individual numbers.
from collections import defaultdict
class RunningAverage:
def __init__(self):
self.total = 0
self.count = 0
def add(self, value):
self.total += value
self.count += 1
def calculate(self):
return self.total / self.count
def func(lst):
thirds = defaultdict(RunningAverage)
for sub in lst:
k = tuple(sub[:2])
thirds[k].add(sub[2])
lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
Original answer
This probably won't be very efficient since it has to accumulate all the values to average them. I think you could get around that by having a running average with a weighting factored in, but I'm not quite sure how to do that.
from collections import defaultdict
def avg(nums):
return sum(nums) / len(nums)
def func(lst):
thirds = defaultdict(list)
for sub in lst:
k = tuple(sub[:2])
thirds[k].append(sub[2])
lst_out = [[*k, avg(v)] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

You can try using groupby.
m = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
from itertools import groupby
m.sort(key=lambda x:x[0]+str(x[1]))
for i,j in groupby(m, lambda x:x[0]+str(x[1])):
ss=0
c=0.0
for k in j:
ss+=k[2]
c+=1.0
print [k[0], k[1], ss/c]

This should be O(N), someone correct me if I'm wrong:
def my_algorithm(input_list):
"""
:param input_list: list of lists in format [string, int, int]
:return: list
"""
# Dict in format (string, int): [int, count_int]
# So our list is in this format, example:
# [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
# so for our dict we will make keys a tuple of the first 2 values of each sublist (since that needs to be unique)
# while values are a list of third element from our sublist + counter (which counts every time we have a duplicate
# key, so we can divide it and get average).
my_dict = {}
for element in input_list:
# key is a tuple of the first 2 values of each sublist
key = (element[0], element[1])
if key not in my_dict:
# If the key do not exists add it.
# Value is in form of third element from our sublist + counter. Since this is first value set counter to 1
my_dict[key] = [element[2], 1]
else:
# If key does exist then increment our value and increment counter by 1
my_dict[key][0] += element[2]
my_dict[key][1] += 1
# we have a dict so we will need to convert it to list (and on the way calculate averages)
return _convert_my_dict_to_list(my_dict)
def _convert_my_dict_to_list(my_dict):
"""
:param my_dict: dict, key is in form of tuple (string, int) and values are in form of list [int, int_counter]
:return: list
"""
my_list = []
for key, value in my_dict.items():
sublist = [key[0], key[1], value[0]/value[1]]
my_list.append(sublist)
return my_list
my_algorithm(x)
This will return:
[['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
While your expected return is:
[["hello", 0, 6.5], ["hi", 0, 6], ["hello", 1, 1]]
If you really need ints then you can modify _convert_my_dict_to_list function.

Here's my variation on this theme: a groupby sans the expensive sort. I also changed the problem to make the input and output a list of tuples as these are fixed-size records:
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
data = [("hello", 0, 5), ("hi", 0, 6), ("hello", 0, 8), ("hello", 1, 1)]
dictionary = defaultdict(complex)
for key, group in groupby(data, itemgetter(slice(2))):
total = sum(value for (string, number, value) in group)
dictionary[key] += total + 1j
array = [(*key, value.real / value.imag) for key, value in dictionary.items()]
print(array)
OUTPUT
> python3 test.py
[('hello', 0, 6.5), ('hi', 0, 6.0), ('hello', 1, 1.0)]
>
Thanks to #wjandrea for the itemgetter replacement for lambda. (And yes, I am using complex numbers in passing for the average to track the total and count.)

More concise way of writing a loop inside a loop in python

I have a list of objects that are a specific class within my code like so,
[object1, object2, object3, object4, object5, object6]
Namely this class has two attributes: class.score and class.id
I might have objects with the same id. Eg.:
[object1.id, object2.id, object3.id, object4.id, object5.id, object6.id] = [1, 2, 3, 4, 2, 3]
But with different scores. Eg.:
[object1.score, object2.score, object3.score, object4.score, object5.score,
object6.score] = [0.25, 0.55, 0.6, 0.4, 0.30, .33]
What I want to do is to have list with no duplicates of this objects id-wise but
adding the scores. So for the previous example the output would be:
[object1.id, object2.id, object3.id, object4.id] = [1, 2, 3, 4]
[object1.score, object2.score, object3.score, object4.score] = [.25, .85, .93, .4]
I have managed to do that with two for loops:
k = 1
for object in list_of_objects:
j = 1
for object2 in list_of_objects:
if object.id == object2.id and j > k:
object.score = object.score + object2.score
list_of_objects.remove(object2)
j += 1
k += 1
But I'm looking to do it in a more python-ish, way something along the lines of:
newlist[:] = [ x for x in list_of_objects if certain_condition(x)]
Thanks.

itertools.groupby was made exactly for this situation
https://docs.python.org/2/library/itertools.html#itertools.groupby
from itertools import groupby
# object.id is our key:
keyfunc = lambda obj: obj.id
list_of_objects = sorted(list_of_objects, key=keyfunc)
scores = [sum(score_list) for id, score_list in groupby(list_of_objects, keyfunc)]
ids = [id for id, score_list in groupby(list_of_objects, keyfunc)]

Normally you do this using a dictionary to detect already seen objects:
seen = {}
for x in my_objects:
if x.id in seen:
seen[x.id].score += x.score
else:
seen[x.id] = x
my_objects[:] = seen.values()
Using a dictionary makes the computation O(n) instead of O(n²)

You can go using the Python Built-in Functions, within a single line, by supplying an additional custom function:
def r(l, o):
if len(l) > 0 and l[-1].id == o.id:
l[-1].score += o.score
else:
l.append(o)
return l
key = attrgetter('id')
And then simply use the reduce function in combination with sorted and the above custom function:
list_of_objects = reduce(r, sorted(list_of_objects, key=key), [])
Then you will have what you need:
[1: 0.25, 2: 0.85, 3: 0.93, 4: 0.4]

Merging python lists based on a 'similar' float value

I have a list (containing tuples) and I want to merge the list based on if the first element is within a maximum distance of the other elements (if if delta value < 0.05). I have the following list as an example:
[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
This should yield something like:
[(0.0, 1.883659017),(1.00422, 0.9998252466431066),(2.00425,0.9951777494430947)]
I am thinking that I can use something similar as in this question (Merge nested list items based on a repeating value) altho a lot of other questions yield a similar answer. The only problem that I see there is that they use collections.defaultdict or itertools.groupby which require exact matching of the element. An important addition here is that I want the first element of a merged tuple to be the weighted mixture of elements, example as follows:
(1.001,80) and (0.99,20) are matched then the result should be (0.9988,100).
Is something similar possible but with the matching based on value difference and not exact match?
What I was trying myself (but don't really like the look of it) is:
Res = 0.05
combinations = itertools.combination(list,2)
for i in combinations:
if i[0][0] > i[1][0]-Res and i[0][0] < i[1][0]+Res:
newValue = ...
-- UPDATE --
Based on some comments and Dawgs answer I tried the following approach:
for fv, v in total:
k=round(fv, 2)
data[k]=data.get(k, 0)+v
using the following list (actual data example, instead of short example list):
total = [(0.0, 0.11630591852564721), (1.00335, 0.25158664272201053), (2.0067, 0.2707487305913156), (3.0100499999999997, 0.19327075057473678), (4.0134, 0.10295042331357719), (5.01675, 0.04364856520231155), (6.020099999999999, 0.015342958201863783), (0.0, 0.9811758192941256), (1.00422, 0.018649427348981), (0.0, 0.9024831978342827), (2.00425, 0.09269455160881204), (0.0, 0.6944298762418107), (0.99703, 0.2536959281304138), (1.99406, 0.045877927988415786)]
which then yields problems with values such as 2.0067 (rounded to 2.01) and 1.99406 (rounded to 1.99( where the total difference is 0.01264 (which is far below 0.05, a value that I had in mind as a 'limit' for now but that should set changeable). Rounding the values to 1 decimal place is also not an option since that would result in a window of ~0.09 with values such as 2.04999 and 1.95001 which both yield 2.0 in that case.
The exact output was:
{0.0: 2.694394811895866, 1.0: 0.5239319982014053, 4.01: 0.10295042331357719, 5.02: 0.04364856520231155, 2.0: 0.09269455160881204, 1.99: 0.045877927988415786, 3.01: 0.19327075057473678, 6.02: 0.015342958201863783, 2.01: 0.2707487305913156}

accum = list()
data = [(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
EPSILON = 0.05
newdata = {d: True for d in data}
for k, v in data:
if not newdata[(k,v)]: continue
newdata[(k,v)] = False
# use each piece of data only once
keys,values = [k*v],[v]
for kk, vv in [d for d in data if newdata[d]]:
if abs(k-kk) < EPSILON:
keys.append(kk*vv)
values.append(vv)
newdata[(kk,vv)] = False
accum.append((sum(keys)/sum(values),sum(values)))

You can round the float values then use setdefault:
li=[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
data={}
for fv, v in li:
k=round(fv, 5)
data.setdefault(k, 0)
data[k]+=v
print data
# {0.0: 1.8836590171284082, 2.00425: 0.9951777494430947, 1.00422: 0.9998252466431066}
If you want some more complex comparison (other than fixed rounding) you can create a hashable object based on the epsilon value you want and use the same method from there.
As pointed out in the comments, this works too:
data={}
for fv, v in li:
k=round(fv, 5)
data[k]=data.get(k, 0)+v

assertAlmostEqual in Python unit-test for collections of floats

The assertAlmostEqual(x, y) method in Python's unit testing framework tests whether x and y are approximately equal assuming they are floats.
The problem with assertAlmostEqual() is that it only works on floats. I'm looking for a method like assertAlmostEqual() which works on lists of floats, sets of floats, dictionaries of floats, tuples of floats, lists of tuples of floats, sets of lists of floats, etc.
For instance, let x = 0.1234567890, y = 0.1234567891. x and y are almost equal because they agree on each and every digit except for the last one. Therefore self.assertAlmostEqual(x, y) is True because assertAlmostEqual() works for floats.
I'm looking for a more generic assertAlmostEquals() which also evaluates the following calls to True:
self.assertAlmostEqual_generic([x, x, x], [y, y, y]).
self.assertAlmostEqual_generic({1: x, 2: x, 3: x}, {1: y, 2: y, 3: y}).
self.assertAlmostEqual_generic([(x,x)], [(y,y)]).
Is there such a method or do I have to implement it myself?
Clarifications:
assertAlmostEquals() has an optional parameter named places and the numbers are compared by computing the difference rounded to number of decimal places. By default places=7, hence self.assertAlmostEqual(0.5, 0.4) is False while self.assertAlmostEqual(0.12345678, 0.12345679) is True. My speculative assertAlmostEqual_generic() should have the same functionality.
Two lists are considered almost equal if they have almost equal numbers in exactly the same order. formally, for i in range(n): self.assertAlmostEqual(list1[i], list2[i]).
Similarly, two sets are considered almost equal if they can be converted to almost equal lists (by assigning an order to each set).
Similarly, two dictionaries are considered almost equal if the key set of each dictionary is almost equal to the key set of the other dictionary, and for each such almost equal key pair there's a corresponding almost equal value.
In general: I consider two collections almost equal if they're equal except for some corresponding floats which are just almost equal to each other. In other words, I would like to really compare objects but with a low (customized) precision when comparing floats along the way.

if you don't mind using NumPy (which comes with your Python(x,y)), you may want to look at the np.testing module which defines, among others, a assert_almost_equal function.
The signature is np.testing.assert_almost_equal(actual, desired, decimal=7, err_msg='', verbose=True)
>>> x = 1.000001
>>> y = 1.000002
>>> np.testing.assert_almost_equal(x, y)
AssertionError:
Arrays are not almost equal to 7 decimals
ACTUAL: 1.000001
DESIRED: 1.000002
>>> np.testing.assert_almost_equal(x, y, 5)
>>> np.testing.assert_almost_equal([x, x, x], [y, y, y], 5)
>>> np.testing.assert_almost_equal((x, x, x), (y, y, y), 5)

As of python 3.5 you may compare using
math.isclose(a, b, rel_tol=1e-9, abs_tol=0.0)
As described in pep-0485.
The implementation should be equivalent to
abs(a-b) <= max( rel_tol * max(abs(a), abs(b)), abs_tol )

Here's how I've implemented a generic is_almost_equal(first, second) function:
First, duplicate the objects you need to compare (first and second), but don't make an exact copy: cut the insignificant decimal digits of any float you encounter inside the object.
Now that you have copies of first and second for which the insignificant decimal digits are gone, just compare first and second using the == operator.
Let's assume we have a cut_insignificant_digits_recursively(obj, places) function which duplicates obj but leaves only the places most significant decimal digits of each float in the original obj. Here's a working implementation of is_almost_equals(first, second, places):
from insignificant_digit_cutter import cut_insignificant_digits_recursively
def is_almost_equal(first, second, places):
'''returns True if first and second equal.
returns true if first and second aren't equal but have exactly the same
structure and values except for a bunch of floats which are just almost
equal (floats are almost equal if they're equal when we consider only the
[places] most significant digits of each).'''
if first == second: return True
cut_first = cut_insignificant_digits_recursively(first, places)
cut_second = cut_insignificant_digits_recursively(second, places)
return cut_first == cut_second
And here's a working implementation of cut_insignificant_digits_recursively(obj, places):
def cut_insignificant_digits(number, places):
'''cut the least significant decimal digits of a number,
leave only [places] decimal digits'''
if type(number) != float: return number
number_as_str = str(number)
end_of_number = number_as_str.find('.')+places+1
if end_of_number > len(number_as_str): return number
return float(number_as_str[:end_of_number])
def cut_insignificant_digits_lazy(iterable, places):
for obj in iterable:
yield cut_insignificant_digits_recursively(obj, places)
def cut_insignificant_digits_recursively(obj, places):
'''return a copy of obj except that every float loses its least significant
decimal digits remaining only [places] decimal digits'''
t = type(obj)
if t == float: return cut_insignificant_digits(obj, places)
if t in (list, tuple, set):
return t(cut_insignificant_digits_lazy(obj, places))
if t == dict:
return {cut_insignificant_digits_recursively(key, places):
cut_insignificant_digits_recursively(val, places)
for key,val in obj.items()}
return obj
The code and its unit tests are available here: https://github.com/snakile/approximate_comparator. I welcome any improvement and bug fix.

If you don't mind using the numpy package then numpy.testing has the assert_array_almost_equal method.
This works for array_like objects, so it is fine for arrays, lists and tuples of floats, but does it not work for sets and dictionaries.
The documentation is here.

There is no such method, you'd have to do it yourself.
For lists and tuples the definition is obvious, but note that the other cases you mention aren't obvious, so it's no wonder such a function isn't provided. For instance, is {1.00001: 1.00002} almost equal to {1.00002: 1.00001}? Handling such cases requires making a choice about whether closeness depends on keys or values or both. For sets you are unlikely to find a meaningful definition, since sets are unordered, so there is no notion of "corresponding" elements.

You may have to implement it yourself, while its true that list and sets can be iterated the same way, dictionaries are a different story, you iterate their keys not values, and the third example seems a bit ambiguous to me, do you mean to compare each value within the set, or each value from each set.
heres a simple code snippet.
def almost_equal(value_1, value_2, accuracy = 10**-8):
return abs(value_1 - value_2) < accuracy
x = [1,2,3,4]
y = [1,2,4,5]
assert all(almost_equal(*values) for values in zip(x, y))

None of these answers work for me. The following code should work for python collections, classes, dataclasses, and namedtuples. I might have forgotten something, but so far this works for me.
import unittest
from collections import namedtuple, OrderedDict
from dataclasses import dataclass
from typing import Any
def are_almost_equal(o1: Any, o2: Any, max_abs_ratio_diff: float, max_abs_diff: float) -> bool:
"""
Compares two objects by recursively walking them trough. Equality is as usual except for floats.
Floats are compared according to the two measures defined below.
:param o1: The first object.
:param o2: The second object.
:param max_abs_ratio_diff: The maximum allowed absolute value of the difference.
`abs(1 - (o1 / o2)` and vice-versa if o2 == 0.0. Ignored if < 0.
:param max_abs_diff: The maximum allowed absolute difference `abs(o1 - o2)`. Ignored if < 0.
:return: Whether the two objects are almost equal.
"""
if type(o1) != type(o2):
return False
composite_type_passed = False
if hasattr(o1, '__slots__'):
if len(o1.__slots__) != len(o2.__slots__):
return False
if any(not are_almost_equal(getattr(o1, s1), getattr(o2, s2),
max_abs_ratio_diff, max_abs_diff)
for s1, s2 in zip(sorted(o1.__slots__), sorted(o2.__slots__))):
return False
else:
composite_type_passed = True
if hasattr(o1, '__dict__'):
if len(o1.__dict__) != len(o2.__dict__):
return False
if any(not are_almost_equal(k1, k2, max_abs_ratio_diff, max_abs_diff)
or not are_almost_equal(v1, v2, max_abs_ratio_diff, max_abs_diff)
for ((k1, v1), (k2, v2))
in zip(sorted(o1.__dict__.items()), sorted(o2.__dict__.items()))
if not k1.startswith('__')): # avoid infinite loops
return False
else:
composite_type_passed = True
if isinstance(o1, dict):
if len(o1) != len(o2):
return False
if any(not are_almost_equal(k1, k2, max_abs_ratio_diff, max_abs_diff)
or not are_almost_equal(v1, v2, max_abs_ratio_diff, max_abs_diff)
for ((k1, v1), (k2, v2)) in zip(sorted(o1.items()), sorted(o2.items()))):
return False
elif any(issubclass(o1.__class__, c) for c in (list, tuple, set)):
if len(o1) != len(o2):
return False
if any(not are_almost_equal(v1, v2, max_abs_ratio_diff, max_abs_diff)
for v1, v2 in zip(o1, o2)):
return False
elif isinstance(o1, float):
if o1 == o2:
return True
else:
if max_abs_ratio_diff > 0: # if max_abs_ratio_diff < 0, max_abs_ratio_diff is ignored
if o2 != 0:
if abs(1.0 - (o1 / o2)) > max_abs_ratio_diff:
return False
else: # if both == 0, we already returned True
if abs(1.0 - (o2 / o1)) > max_abs_ratio_diff:
return False
if 0 < max_abs_diff < abs(o1 - o2): # if max_abs_diff < 0, max_abs_diff is ignored
return False
return True
else:
if not composite_type_passed:
return o1 == o2
return True
class EqualityTest(unittest.TestCase):
def test_floats(self) -> None:
o1 = ('hi', 3, 3.4)
o2 = ('hi', 3, 3.400001)
self.assertTrue(are_almost_equal(o1, o2, 0.0001, 0.0001))
self.assertFalse(are_almost_equal(o1, o2, 0.00000001, 0.00000001))
def test_ratio_only(self):
o1 = ['hey', 10000, 123.12]
o2 = ['hey', 10000, 123.80]
self.assertTrue(are_almost_equal(o1, o2, 0.01, -1))
self.assertFalse(are_almost_equal(o1, o2, 0.001, -1))
def test_diff_only(self):
o1 = ['hey', 10000, 1234567890.12]
o2 = ['hey', 10000, 1234567890.80]
self.assertTrue(are_almost_equal(o1, o2, -1, 1))
self.assertFalse(are_almost_equal(o1, o2, -1, 0.1))
def test_both_ignored(self):
o1 = ['hey', 10000, 1234567890.12]
o2 = ['hey', 10000, 0.80]
o3 = ['hi', 10000, 0.80]
self.assertTrue(are_almost_equal(o1, o2, -1, -1))
self.assertFalse(are_almost_equal(o1, o3, -1, -1))
def test_different_lengths(self):
o1 = ['hey', 1234567890.12, 10000]
o2 = ['hey', 1234567890.80]
self.assertFalse(are_almost_equal(o1, o2, 1, 1))
def test_classes(self):
class A:
d = 12.3
def __init__(self, a, b, c):
self.a = a
self.b = b
self.c = c
o1 = A(2.34, 'str', {1: 'hey', 345.23: [123, 'hi', 890.12]})
o2 = A(2.34, 'str', {1: 'hey', 345.231: [123, 'hi', 890.121]})
self.assertTrue(are_almost_equal(o1, o2, 0.1, 0.1))
self.assertFalse(are_almost_equal(o1, o2, 0.0001, 0.0001))
o2.hello = 'hello'
self.assertFalse(are_almost_equal(o1, o2, -1, -1))
def test_namedtuples(self):
B = namedtuple('B', ['x', 'y'])
o1 = B(3.3, 4.4)
o2 = B(3.4, 4.5)
self.assertTrue(are_almost_equal(o1, o2, 0.2, 0.2))
self.assertFalse(are_almost_equal(o1, o2, 0.001, 0.001))
def test_classes_with_slots(self):
class C(object):
__slots__ = ['a', 'b']
def __init__(self, a, b):
self.a = a
self.b = b
o1 = C(3.3, 4.4)
o2 = C(3.4, 4.5)
self.assertTrue(are_almost_equal(o1, o2, 0.3, 0.3))
self.assertFalse(are_almost_equal(o1, o2, -1, 0.01))
def test_dataclasses(self):
#dataclass
class D:
s: str
i: int
f: float
#dataclass
class E:
f2: float
f4: str
d: D
o1 = E(12.3, 'hi', D('hello', 34, 20.01))
o2 = E(12.1, 'hi', D('hello', 34, 20.0))
self.assertTrue(are_almost_equal(o1, o2, -1, 0.4))
self.assertFalse(are_almost_equal(o1, o2, -1, 0.001))
o3 = E(12.1, 'hi', D('ciao', 34, 20.0))
self.assertFalse(are_almost_equal(o2, o3, -1, -1))
def test_ordereddict(self):
o1 = OrderedDict({1: 'hey', 345.23: [123, 'hi', 890.12]})
o2 = OrderedDict({1: 'hey', 345.23: [123, 'hi', 890.0]})
self.assertTrue(are_almost_equal(o1, o2, 0.01, -1))
self.assertFalse(are_almost_equal(o1, o2, 0.0001, -1))

Use Pandas
Another way is to convert each of the two dicts etc into pandas dataframes and then use pd.testing.assert_frame_equal() to compare the two. I have used this successfully to compare lists of dicts.
Previous answers often don't work on structures involving dictionaries, but this one should. I haven't exhaustively tested this on highly nested structures, but imagine pandas would handle them correctly.
Example 1: compare two dicts
To illustrate this I will use your example data of a dict, since the other methods don't work with dicts. Your dict was:
x, y = 0.1234567890, 0.1234567891
{1: x, 2: x, 3: x}, {1: y, 2: y, 3: y}
Then we can do:
pd.testing.assert_frame_equal(
pd.DataFrame.from_dict({1: x, 2: x, 3: x}, orient='index') ,
pd.DataFrame.from_dict({1: y, 2: y, 3: y}, orient='index') )
This doesn't raise an error, meaning that they are equal to a certain degree of precision.
However if we were to do
pd.testing.assert_frame_equal(
pd.DataFrame.from_dict({1: x, 2: x, 3: x}, orient='index') ,
pd.DataFrame.from_dict({1: y, 2: y, 3: y + 1}, orient='index') ) #add 1 to last value
then we are rewarded with the following informative message:
AssertionError: DataFrame.iloc[:, 0] (column name="0") are different
DataFrame.iloc[:, 0] (column name="0") values are different (33.33333 %)
[index]: [1, 2, 3]
[left]: [0.123456789, 0.123456789, 0.123456789]
[right]: [0.1234567891, 0.1234567891, 1.1234567891]
For further details see pd.testing.assert_frame_equal documentation , particularly parameters check_exact, rtol, atol for info about how to specify required degree of precision either relative or actual.
Example 2: Nested dict of dicts
a = {i*10 : {1:1.1,2:2.1} for i in range(4)}
b = {i*10 : {1:1.1000001,2:2.100001} for i in range(4)}
# a = {0: {1: 1.1, 2: 2.1}, 10: {1: 1.1, 2: 2.1}, 20: {1: 1.1, 2: 2.1}, 30: {1: 1.1, 2: 2.1}}
# b = {0: {1: 1.1000001, 2: 2.100001}, 10: {1: 1.1000001, 2: 2.100001}, 20: {1: 1.1000001, 2: 2.100001}, 30: {1: 1.1000001, 2: 2.100001}}
and then do
pd.testing.assert_frame_equal( pd.DataFrame(a), pd.DataFrame(b) )
- it doesn't raise an error: all values fairly similar.
However, if we change a value e.g.
b[30][2] += 1
# b = {0: {1: 1.1000001, 2: 2.1000001}, 10: {1: 1.1000001, 2: 2.1000001}, 20: {1: 1.1000001, 2: 2.1000001}, 30: {1: 1.1000001, 2: 3.1000001}}
and then run the same test, we get the following clear error message:
AssertionError: DataFrame.iloc[:, 3] (column name="30") are different
DataFrame.iloc[:, 3] (column name="30") values are different (50.0 %)
[index]: [1, 2]
[left]: [1.1, 2.1]
[right]: [1.1000001, 3.1000001]

Looking at this myself, I used the addTypeEqualityFunc method of the UnitTest library in combination with math.isclose.
Sample setup:
import math
from unittest import TestCase
class SomeFixtures(TestCase):
#classmethod
def float_comparer(cls, a, b, msg=None):
if len(a) != len(b):
raise cls.failureException(msg)
if not all(map(lambda args: math.isclose(*args), zip(a, b))):
raise cls.failureException(msg)
def some_test(self):
self.addTypeEqualityFunc(list, self.float_comparer)
self.assertEqual([1.0, 2.0, 3.0], [1.0, 2.0, 3.0])

I would still use self.assertEqual() for it stays the most informative when shit hits the fan. You can do that by rounding, eg.
self.assertEqual(round_tuple((13.949999999999999, 1.121212), 2), (13.95, 1.12))
where round_tuple is
def round_tuple(t: tuple, ndigits: int) -> tuple:
return tuple(round(e, ndigits=ndigits) for e in t)
def round_list(l: list, ndigits: int) -> list:
return [round(e, ndigits=ndigits) for e in l]
According to the python docs (see https://stackoverflow.com/a/41407651/1031191) you can get away with rounding issues like 13.94999999, because 13.94999999 == 13.95 is True.

You can also recursively call the already present unittest.assertAlmostEquals() and keep track of what element you are comparing, by adding a method to your unittest.
E.g. for lists of lists and list of tuples of floats:
def assertListAlmostEqual(self, first, second, delta=None, context=None):
"""Asserts lists of lists or tuples to check if they compare and
shows which element is wrong when comparing two lists
"""
self.assertEqual(len(first), len(second), msg="List have different length")
context = [first, second] if context is None else context
for i in range(0, len(first)):
if isinstance(first[0], tuple):
context.append(i)
self.assertListAlmostEqual(first[i], second[i], delta, context=context)
if isinstance(first[0], list):
context.append(i)
self.assertListAlmostEqual(first[i], second[i], delta, context=context)
elif isinstance(first[0], float):
msg = "Difference in \n{} and \n{}\nFaulty element index={}".format(context[0], context[1], context[2:]+[i]) \
if context is not None else None
self.assertAlmostEqual(first[i], second[i], delta, msg=msg)
Outputs something like:
line 23, in assertListAlmostEqual
self.assertAlmostEqual(first[i], second[i], delta, msg=msg)
AssertionError: 5.0 != 6.0 within 7 places (1.0 difference) : Difference in
[(0.0, 5.0), (8.0, 2.0), (10.0, 1.999999), (11.0, 1.9999989090909092)] and
[(0.0, 6.0), (8.0, 2.0), (10.0, 1.999999), (11.0, 1.9999989)]
Faulty element index=[0, 1]

An alternative approach is to convert your data into a comparable form by e.g turning each float into a string with fixed precision.
def comparable(data):
"""Converts `data` to a comparable structure by converting any floats to a string with fixed precision."""
if isinstance(data, (int, str)):
return data
if isinstance(data, float):
return '{:.4f}'.format(data)
if isinstance(data, list):
return [comparable(el) for el in data]
if isinstance(data, tuple):
return tuple([comparable(el) for el in data])
if isinstance(data, dict):
return {k: comparable(v) for k, v in data.items()}
Then you can:
self.assertEquals(comparable(value1), comparable(value2))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extending grouping code to handle more general inputs - python

Related

Sum of duplicate values in 2d array

Python: Removing list duplicates based on first 2 inner list values

More concise way of writing a loop inside a loop in python

Merging python lists based on a 'similar' float value

assertAlmostEqual in Python unit-test for collections of floats

Categories

Resources