Merging python lists based on a 'similar' float value - python

I have a list (containing tuples) and I want to merge the list based on if the first element is within a maximum distance of the other elements (if if delta value < 0.05). I have the following list as an example:
[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
This should yield something like:
[(0.0, 1.883659017),(1.00422, 0.9998252466431066),(2.00425,0.9951777494430947)]
I am thinking that I can use something similar as in this question (Merge nested list items based on a repeating value) altho a lot of other questions yield a similar answer. The only problem that I see there is that they use collections.defaultdict or itertools.groupby which require exact matching of the element. An important addition here is that I want the first element of a merged tuple to be the weighted mixture of elements, example as follows:
(1.001,80) and (0.99,20) are matched then the result should be (0.9988,100).
Is something similar possible but with the matching based on value difference and not exact match?
What I was trying myself (but don't really like the look of it) is:
Res = 0.05
combinations = itertools.combination(list,2)
for i in combinations:
if i[0][0] > i[1][0]-Res and i[0][0] < i[1][0]+Res:
newValue = ...
-- UPDATE --
Based on some comments and Dawgs answer I tried the following approach:
for fv, v in total:
k=round(fv, 2)
data[k]=data.get(k, 0)+v
using the following list (actual data example, instead of short example list):
total = [(0.0, 0.11630591852564721), (1.00335, 0.25158664272201053), (2.0067, 0.2707487305913156), (3.0100499999999997, 0.19327075057473678), (4.0134, 0.10295042331357719), (5.01675, 0.04364856520231155), (6.020099999999999, 0.015342958201863783), (0.0, 0.9811758192941256), (1.00422, 0.018649427348981), (0.0, 0.9024831978342827), (2.00425, 0.09269455160881204), (0.0, 0.6944298762418107), (0.99703, 0.2536959281304138), (1.99406, 0.045877927988415786)]
which then yields problems with values such as 2.0067 (rounded to 2.01) and 1.99406 (rounded to 1.99( where the total difference is 0.01264 (which is far below 0.05, a value that I had in mind as a 'limit' for now but that should set changeable). Rounding the values to 1 decimal place is also not an option since that would result in a window of ~0.09 with values such as 2.04999 and 1.95001 which both yield 2.0 in that case.
The exact output was:
{0.0: 2.694394811895866, 1.0: 0.5239319982014053, 4.01: 0.10295042331357719, 5.02: 0.04364856520231155, 2.0: 0.09269455160881204, 1.99: 0.045877927988415786, 3.01: 0.19327075057473678, 6.02: 0.015342958201863783, 2.01: 0.2707487305913156}

accum = list()
data = [(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
EPSILON = 0.05
newdata = {d: True for d in data}
for k, v in data:
if not newdata[(k,v)]: continue
newdata[(k,v)] = False
# use each piece of data only once
keys,values = [k*v],[v]
for kk, vv in [d for d in data if newdata[d]]:
if abs(k-kk) < EPSILON:
keys.append(kk*vv)
values.append(vv)
newdata[(kk,vv)] = False
accum.append((sum(keys)/sum(values),sum(values)))

You can round the float values then use setdefault:
li=[(0.0, 0.9811758192941256), (1.00422, 0.9998252466431066), (0.0, 0.9024831978342827), (2.00425, 0.9951777494430947)]
data={}
for fv, v in li:
k=round(fv, 5)
data.setdefault(k, 0)
data[k]+=v
print data
# {0.0: 1.8836590171284082, 2.00425: 0.9951777494430947, 1.00422: 0.9998252466431066}
If you want some more complex comparison (other than fixed rounding) you can create a hashable object based on the epsilon value you want and use the same method from there.
As pointed out in the comments, this works too:
data={}
for fv, v in li:
k=round(fv, 5)
data[k]=data.get(k, 0)+v

Related

Python: Removing list duplicates based on first 2 inner list values

Question:
I have a list in the following format:
x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
The algorithm:
Combine all inner lists with the same starting 2 values, the third value doesn't have to be the same to combine them
e.g. "hello",0,5 is combined with "hello",0,8
But not combined with "hello",1,1
The 3rd value becomes the average of the third values: sum(all 3rd vals) / len(all 3rd vals)
Note: by all 3rd vals I am referring to the 3rd value of each inner list of duplicates
e.g. "hello",0,5 and "hello",0,8 becomes hello,0,6.5
Desired output: (Order of list doesn't matter)
x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]
Question:
How can I implement this algorithm in Python?
Ideally it would be efficient as this will be used on very large lists.
If anything is unclear let me know and I will explain.
Edit: I have tried to change the list to a set to remove duplicates, however this doesn't account for the third variable in the inner lists and therefore doesn't work.
Solution Performance:
Thanks to everyone who has provided a solution to this problem! Here
are the results based on a speed test of all the functions:
Update using running sum and count
I figured out how to improve my previous code (see original below). You can keep running totals and counts, then compute the averages at the end, which avoids recording all the individual numbers.
from collections import defaultdict
class RunningAverage:
def __init__(self):
self.total = 0
self.count = 0
def add(self, value):
self.total += value
self.count += 1
def calculate(self):
return self.total / self.count
def func(lst):
thirds = defaultdict(RunningAverage)
for sub in lst:
k = tuple(sub[:2])
thirds[k].add(sub[2])
lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
Original answer
This probably won't be very efficient since it has to accumulate all the values to average them. I think you could get around that by having a running average with a weighting factored in, but I'm not quite sure how to do that.
from collections import defaultdict
def avg(nums):
return sum(nums) / len(nums)
def func(lst):
thirds = defaultdict(list)
for sub in lst:
k = tuple(sub[:2])
thirds[k].append(sub[2])
lst_out = [[*k, avg(v)] for k, v in thirds.items()]
return lst_out
print(func(x)) # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
You can try using groupby.
m = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
from itertools import groupby
m.sort(key=lambda x:x[0]+str(x[1]))
for i,j in groupby(m, lambda x:x[0]+str(x[1])):
ss=0
c=0.0
for k in j:
ss+=k[2]
c+=1.0
print [k[0], k[1], ss/c]
This should be O(N), someone correct me if I'm wrong:
def my_algorithm(input_list):
"""
:param input_list: list of lists in format [string, int, int]
:return: list
"""
# Dict in format (string, int): [int, count_int]
# So our list is in this format, example:
# [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
# so for our dict we will make keys a tuple of the first 2 values of each sublist (since that needs to be unique)
# while values are a list of third element from our sublist + counter (which counts every time we have a duplicate
# key, so we can divide it and get average).
my_dict = {}
for element in input_list:
# key is a tuple of the first 2 values of each sublist
key = (element[0], element[1])
if key not in my_dict:
# If the key do not exists add it.
# Value is in form of third element from our sublist + counter. Since this is first value set counter to 1
my_dict[key] = [element[2], 1]
else:
# If key does exist then increment our value and increment counter by 1
my_dict[key][0] += element[2]
my_dict[key][1] += 1
# we have a dict so we will need to convert it to list (and on the way calculate averages)
return _convert_my_dict_to_list(my_dict)
def _convert_my_dict_to_list(my_dict):
"""
:param my_dict: dict, key is in form of tuple (string, int) and values are in form of list [int, int_counter]
:return: list
"""
my_list = []
for key, value in my_dict.items():
sublist = [key[0], key[1], value[0]/value[1]]
my_list.append(sublist)
return my_list
my_algorithm(x)
This will return:
[['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]
While your expected return is:
[["hello", 0, 6.5], ["hi", 0, 6], ["hello", 1, 1]]
If you really need ints then you can modify _convert_my_dict_to_list function.
Here's my variation on this theme: a groupby sans the expensive sort. I also changed the problem to make the input and output a list of tuples as these are fixed-size records:
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
data = [("hello", 0, 5), ("hi", 0, 6), ("hello", 0, 8), ("hello", 1, 1)]
dictionary = defaultdict(complex)
for key, group in groupby(data, itemgetter(slice(2))):
total = sum(value for (string, number, value) in group)
dictionary[key] += total + 1j
array = [(*key, value.real / value.imag) for key, value in dictionary.items()]
print(array)
OUTPUT
> python3 test.py
[('hello', 0, 6.5), ('hi', 0, 6.0), ('hello', 1, 1.0)]
>
Thanks to #wjandrea for the itemgetter replacement for lambda. (And yes, I am using complex numbers in passing for the average to track the total and count.)

Extract array values that are nearly identical

I have this numpy array:
a = np.array([[8.04,9], [2.02,3], [8,10], [2,3], [8.12,18], [8.04,18],[2,8],[11,14]])
From this array, I would like to find nearly identical row values (not more than 0.05 for the first index AND not more than 1 for the second index) and create new sub-arrays.
For this example, this would give 6 different arrays (which could be part of large array).
a1 = [[8.04,9],[8,10]]
a2 = [[2.02,3],[2,3]]
a3 = [8.12,18]
a4 = [8.04,18]
a5 = [2,8]
a6 = [11,14]
Is there a way to do that ?
Best
Here's a simple method:
for pair in a:
cond1 = np.isclose(a[:,0], pair[0], atol=0.05)
cond2 = np.isclose(a[:,1], pair[1], atol=1)
print(a[cond1 & cond2])
With deduplication:
done = np.zeros(len(a), bool)
for ii, pair in enumerate(a):
if done[ii]:
continue
cond = np.isclose(a[:,0], pair[0], atol=0.05)
cond &= np.isclose(a[:,1], pair[1], atol=1)
print(a[cond])
done |= cond
The OP asks for grouping pairs, not simply printing pairs, so the solution proposed by John Zwink is incomplete. To get the complete answer, the idea is to convert ndarrays into a hashable equivalent (e.g. tuple) and combine all them into a set to avoid duplication. Here:
import numpy as np
a = np.array([[8.04,9], [2.02,3], [8,10], [2,3], [8.12,18], [8.04,18],[2,8],[11,14]])
groups = set()
for pair in a:
cond1 = np.isclose(a[:,0], pair[0], atol=0.05)
cond2 = np.isclose(a[:,1], pair[1], atol=1.000000001)
groups.add(tuple(map(tuple, a[cond1 & cond2])))
print(groups)
Result:
{((8.12, 18.0),),
((8.04, 18.0),),
((2.02, 3.0), (2.0, 3.0)),
((11.0, 14.0),),
((8.04, 9.0), (8.0, 10.0)),
((2.0, 8.0),)}
Note: I added an arbitrary epsilon to the second condition, to get the same grouping as wanted in the OP

Python list of tuple pairs: extract the first element based on nearest second element

Really a beginner at this:
I have a list: ((0.1, 5.4), (0.2, 5.6), (0.3, 6.0)) etc...
With a user input: 5.7
I would like to extract the nearest 'first element', in this case it should be 0.2.
How can I go about this?
EDIT: I guess this is called a tuple containing pairs.
The min function receives a key parameter, that can be used as a key to search for the mininum value:
value = 5.7
result = min(((0.1, 5.4), (0.2, 5.6), (0.3, 6.0)), key=lambda x: abs(value - x[1]))
print(result[0])
Output
0.2

python float precision: will this increment work reliably?

I use the following code to dynamically generate a list of dictionaries of every combination of incremental probabilities associated with a given list of items, such that the probabilities sum to 1. For example, if the increment_divisor were 2 (leading to increment of 1/2 or 0.5), and the list contained 3 items ['a', 'b', 'c'], then the function should return
[{'a': 0.5, 'b': 0.5, 'c': 0.0},
{'a': 0.5, 'b': 0.0, 'c': 0.5},
{'a': 0.0, 'b': 0.5, 'c': 0.5},
{'a': 1.0, 'b': 0.0, 'c': 0.0},
{'a': 0.0, 'b': 1.0, 'c': 0.0},
{'a': 0.0, 'b': 0.0, 'c': 1.0}]
The code is as follows. The script generates the incrementer by calculating 1/x and then iteratively adds the incrementer to increments until the value is >= 1.0. I already know that python floats are imprecise, but I want to be sure that the last value in increments will be something very close to 1.0.
from collections import OrderedDict
from itertools import permutations
def generate_hyp_space(list_of_items, increment_divisor):
"""Generate list of OrderedDicts filling the hypothesis space.
Each OrderedDict is of the form ...
{ i1: 0.0, i2: 0.1, i3: 0.0, ...}
... where .values() sums to 1.
Arguments:
list_of_items -- items that receive prior weights
increment_divisor -- Increment by 1/increment_divisor. For example,
4 yields (0.0, 0.25, 0.5, 0.75, 1.0).
"""
_LEN = len(list_of_items)
if increment_divisor < _LEN: # permutations() returns None if this is True
print('WARN: increment_divisor too small, so was reset to '
'len(list_of_items).', file=sys.stderr)
increment_divisor = _LEN
increment_size = 1/increment_divisor
h_space = []
increments = []
incremental = 0.0
while incremental <= 1.0:
increments.append(incremental)
incremental += increment_size
for p in permutations(increments, _LEN):
if sum(p) == 1.0:
h_space.append(OrderedDict([(list_of_items[i], p[i])
for i in range(_LEN)]))
return h_space
How large can the increment_divisor be before the imprecision of float breaks the reliability of the script? (specifically, while incremental <= 1.0 and if sum(p) == 1.0)
This is a small example, but real use will involve much larger permutation space. Is there a more efficient/effective way to achieve this goal? (I already plan to implement a cache.) Would numpy datatypes be useful here for speed or precision?
The script generates the incrementer by calculating 1/x and then iteratively adds the incrementer to increments until the value is >= 1.0.
No, no, no. Just make a list of [0/x, 1/x, ..., (x-1)/x, x/x] by dividing each integer from 0 to x by x:
increments = [i/increment_divisor for i in range(increment_divisor+1)]
# or for Python 2
increments = [1.0*i/increment_divisor for i in xrange(increment_divisor+1)]
The list will always have exactly the right number of elements, no matter what rounding errors occur.
With NumPy, this would be numpy.linspace:
increments = numpy.linspace(start=0, stop=1, num=increment_divisor+1)
As for your overall problem, working in floats at all is probably a bad idea. You should be able to do the whole thing with integers and only divide by increment_divisor at the end, so you don't have to deal with floating-point precision issues in sum(p) == 1.0. Also, itertools.permutations doesn't do what you want, since it doesn't allow repeated items in the same permutation.
Instead of filtering permutations at all, you should use an algorithm based on the stars and bars idea to generate all possible ways to place len(list_of_items) - 1 separators between increment_divisor outcomes, and convert separator placements to probability dicts.
Thanks to #user2357112 for...
...pointing out the approach to work with ints until the last step.
...directing me to stars and bars approach.
I implemented stars_and_bars as a generator as follows:
def stars_and_bars(n, k, the_list=[]):
"""Distribute n probability tokens among k endings.
Generator implementation of the stars-and-bars algorithm.
Arguments:
n -- number of probability tokens (stars)
k -- number of endings/bins (bars+1)
"""
if n == 0:
yield the_list + [0]*k
elif k == 1:
yield the_list + [n]
else:
for i in range(n+1):
yield from stars_and_bars(n-i, k-1, the_list+[i])

Extending grouping code to handle more general inputs

I need to group either a list of floats, or a list of (named)tuples of varying length, into groups based on whether or not the key is bigger or smaller than a given value.
For example given a list of powers of 2 less than 1, and a list of cutoffs:
twos = [2**(-(i+1)) for i in range(0,10)]
cutoffs = [0.5, 0.125, 0.03125]
Then function
split_into_groups(twos, cutoffs)
should return
[[0.5], [0.25, 0.125], [0.0625, 0.03125], [0.015625, 0.0078125, 0.00390625, 0.001953125, 0.0009765625]]
I've implemented the function like this:
def split_by_prob(items, cutoff, groups, key=None):
for k,g in groupby(enumerate(items), lambda (j,x): x<cutoff):
groups.append((map(itemgetter(1),g)))
return groups
def split_into_groups(items, cutoffs, key=None):
groups = items
final = []
for i in cutoffs:
groups = split_by_prob(groups,i,[],key)
if len(groups) > 1:
final.append(groups[0])
groups = groups.pop()
else:
final.append(groups[0])
return final
final.append(groups)
return final
The tests that these currently pass are:
>>> split_by_prob(twos, 0.5, [])
[[0.5], [0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.00390625, 0.001953125, 0.0009765625]]
>>> split_into_groups(twos, cutoffs)
[[0.5], [0.25, 0.125], [0.0625, 0.03125], [0.015625, 0.0078125, 0.00390625, 0.001953125, 0.0009765625]]
>>> split_into_groups(twos, cutoffs_p10)
[[0.5, 0.25, 0.125], [0.0625, 0.03125, 0.015625], [0.0078125, 0.00390625, 0.001953125], [0.0009765625]]
Where cutoffs_p10 = [10**(-(i+1)) for i in range(0,5)]
I can straightforwardly extend this to a list of tuples of the form
items = zip(range(0,10), twos)
by changing
def split_by_prob(items, cutoff, groups, key=None):
for k,g in groupby(enumerate(items), lambda (j,x): x<cutoff):
groups.append((map(itemgetter(1),g)))
return groups
to
def split_by_prob(items, cutoff, groups, key=None):
for k,g in groupby(enumerate(items), lambda (j,x): x[1]<cutoff):
groups.append((map(itemgetter(1),g)))
return groups
How do I go about extending the original method by adding a key that defaults to a list of floats (or ints etc) but one that could handle tuples and namedtuples?
For example something like:
split_into_groups(items, cutoffs, key=items[0])
would return
[[(0,0.5)], [(1,0.25), (2,0.125)], [(3,0.0625), (4,0.03125)], [(5,0.015625), (6,0.0078125), (7,0.00390625), (8,0.001953125), (9,0.0009765625)]]
In my answer I assume, the cutoffs are at the end in increasing order - just to simplify the situation.
Discriminator detecting a slot
class Discriminator(object):
def __init__(self, cutoffs):
self.cutoffs = sorted(cutoffs)
self.maxslot = len(cutoffs)
def findslot(self, num):
cutoffs = self.cutoffs
for slot, edge in enumerate(self.cutoffs):
if num < edge:
return slot
return self.maxslot
grouper to put items into slots
from collections import defaultdict
def grouper(cutoffs, items, key=None):
if not key:
key = lambda itm: itm
discr = Discriminator(cutoffs)
result = defaultdict(list)
for item in items:
num = key(item)
result[discr.findslot(num)].append(item)
return result
def split_into_groups(cutoffs, numbers, key=None):
groups = grouper(cutoffs, numbers, key)
slot_ids = sorted(groups.keys())
return [groups[slot_id] for slot_id in slot_ids]
Conclusions about Discriminator and grouper
Proposed Discriminator works even for unsorted items.
Conclusions about key
In fact, providing the key functions is easier, than it originally looked.
It is just a function provided via parameter, so it becomes an alias for the transformation function to call to get the value, we want to use for comparing, grouping etc.
There is special case of None, for such a situation we have to use some identity function.
Simplest one is
func = lambda itm: itm
Note: all the functions above were tested by a test suite (incl. use of key function, but I removed it from this answer as it was becoming far too long.

Categories