Sum and aggregate over a list with equal values - python

I have a pair lists of the same length, the first containing int values and the second contains float values. I wish to replace these with another pair of lists which may be shorter, but still have the same length, in which the first list will contain only unique values, and the second list will contain the sums for each matching value. That is, if the i'th element of the first list in the new pair is x, and the indices in the first list of the original pair in which x has appeared are i_1,...,i_k, then the i'th element of the second list in the new pair should contain the sum of the values in indices i_1,...,i_k in the second list of the original pair.
An example will clarify.
Input:
([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
Ourput:
([1, 2, 3], [1.0, 0.5, 1.0])
I was trying to do some list comprehension trick here but failed. I can write a silly loop function for that, but I believe there should be something much nicer here.

Not a one-liner, but since you've not posted your solution I'll suggest this solution that is using collections.OrderedDict:
>>> from collections import OrderedDict
>>> a, b = ([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
>>> d = OrderedDict()
>>> for k, v in zip(a, b):
... d[k] = d.get(k, 0) + v
...
>>> d.keys(), d.values()
([1, 2, 3], [1.0, 0.5, 1.0])
Of course if order doesn't matter then it's better to use collections.defaultdict:
>>> from collections import defaultdict
>>> a, b = ([1, 'foo', 'foo', 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
>>> d = defaultdict(int)
>>> for k, v in zip(a, b):
d[k] += + v
...
>>> d.keys(), d.values()
([3, 1, 'foo'], [1.0, 1.0, 0.5])

one way to go is using pandas:
>>> import pandas as pd
>>> df = pd.DataFrame({'tag':[1, 2, 2, 1, 1, 3],
'val':[0.1, 0.2, 0.3, 0.4, 0.5, 1.0]})
>>> df
tag val
0 1 0.1
1 2 0.2
2 2 0.3
3 1 0.4
4 1 0.5
5 3 1.0
>>> df.groupby('tag')['val'].aggregate('sum')
tag
1 1.0
2 0.5
3 1.0
Name: val, dtype: float64

Build a map with the keys:
la,lb = ([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
m = {k:0.0 for k in la}
And fill it with the summations:
for i in xrange(len(lb)):
m[la[i]] += lb[i]
Finally, from your map:
zip(*[(k,m[k]) for k in m]*1)

Related

Python: Group same integer values, then average

I have a huge array of data and I would like to do subgroups for the values for same integers and then take their average.
For example:
a = [0, 0.5, 1, 1.5, 2, 2.5]
I want to take sub groups as follows:
[0, 0.5] [1, 1.5] [2, 2.5]
... and then take the average and put all the averages in a new array.
Assuming you want to group by the number's integer value (so the number rounded down), something like this could work:
>>> a = [0, 0.5, 1, 1.5, 2, 2.5]
>>> groups = [list(g) for _, g in itertools.groupby(a, int)]
>>> groups
[[0, 0.5], [1, 1.5], [2, 2.5]]
Then averaging becomes:
>>> [sum(grp) / len(grp) for grp in groups]
[0.25, 1.25, 2.25]
This assumes a is already sorted, as in your example.
Ref: itertools.groupby, list comprehensions.
If you have no problem using additional libraries:
import pandas as pd
import numpy as np
a = [0, 0.5, 1, 1.5, 2, 2.5]
print(pd.Series(a).groupby(np.array(a, dtype=np.int32)).mean())
Gives:
0 0.25
1 1.25
2 2.25
dtype: float64
If you want an approach with dictionary, you can go ahead like this:
dic={}
a = [0, 0.5, 1, 1.5, 2, 2.5]
for items in a:
if int(items) not in dic:
dic[int(items)]=[]
dic[int(items)].append(items)
print(dic)
for items in dic:
dic[items]=sum(dic[items])/len(dic[items])
print(dic)
You can use groupby to easily get that (you might need to sort the list first):
from itertools import groupby
from statistics import mean
a = [0, 0.5, 1, 1.5, 2, 2.5]
for k, group in groupby(a, key=int):
print(mean(group))
Will give:
0.25
1.25
2.25

Python convert a vector into inverse frequencies

So given this numpy array:
import numpy as np
vector = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
# len(vector) == 12
# 2 x ones, 4 x two, 6 x three
How can I convert this into a vector of inverse frequencies?
Such that for each value, the output contains 1 divided by the frequency of that value:
array([0.16, 0.33, 0.33, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.33, 0.33, 0.16])
[Update to a general one]
How about this one using np.histogram:
import numpy as np
l = np.array([1,2,2,3,3,3,3,3,3,2,2,1])
_u, _l = np.unique(l, return_inverse=True)
np.histogram(_l, bins=np.arange(_u.size+1))[0][_l] / _l.size
This essentially requires a grouping operation, which numpy isn't great at... but pandas is. You can do this with groupby + transform + count, and divide the result by the length of vector.
import pandas as pd
s = pd.Series(vector)
vector = (s.groupby(s).transform('count') / len(s)).values
vector
array([ 0.16666667, 0.33333333, 0.33333333, 0.5 , 0.5 ,
0.5 , 0.5 , 0.5 , 0.5 , 0.33333333,
0.33333333, 0.16666667])
You can use collections.Counter to first determine the frequency of each element. Then create an intermediate mapping dict which will contain key as the element and value as the frequency. Finally using numpy.vectorize to transform the array to your desired format
>>> import numpy as np
>>> from collections import Counter
>>> v = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
>>> freq_dict = Counter(v)
At this point the freq_dict will contains frequency of each element like
>>> freq_dict
>>> Counter({3: 6, 2: 4, 1: 2})
Next build a probability dict of the format element: probability, using dict comprehension
>>> prob_dict = dict((k,round(val/len(v),3)) for k, val in freq_dict.items())
>>> prob_dict
>>> {1: 0.167, 2: 0.333, 3: 0.5}
Finally using numpy.vectorize to get your desired output
>>> out = np.vectorize(prob_dict.get)(v)
This will produce:
>>> out
>>> array([ 0.167, 0.333, 0.333, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.333, 0.333, 0.167])

iterate through an array looking at multiple values

Test Array = [1, 2, 3, 1, 0.4, 1, 0.1, 0.4, 0.3, 1, 2]
I need to iterate through an array in order to find the first time that 3 consecutive entries are <0.5, and to return the index of this occurence.
Test Array = [1, 2, 3, 1, 0.4, 1, 0.1, 0.4, 0.3, 1, 2]
^ ^ ^ ^
(indices) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
^
So within this test array the index/value that is being looked for is 6
Alongside a proposed solution, it would be good to know what value gets returned if the '3 consecutive values <0.5' condition is not met - would it simply return nothing? or the last index number?
(I would like the returned value to be 0 if the condition is not met)
You can use zip and enumerate:
def solve(lis, num):
for i, (x,y,z) in enumerate(zip(lis, lis[1:], lis[2:])):
if all(k < num for k in (x,y,z)):
return i
#default return value if none of the items matched the condition
return -1 #or use None
...
>>> lis = [1, 2, 3, 1, 0.4, 1, 0.1, 0.4, 0.3, 1, 2]
>>> solve(lis, 0.5)
6
>>> solve(lis, 4) # for values >3 the answer is index 0,
0 # so 0 shouldn't be the default return value.
>>> solve(lis, .1)
-1
Use itertools.izip for memory efficient solution.
from itertools import groupby
items = [1, 2, 3, 1, 0.4, 1, 0.1, 0.4, 0.3, 1, 2]
def F(items, num, k):
# find first consecutive group < num of length k
groups = (list(g) for k, g in groupby(items, key=num.__gt__) if k)
return next((g[0] for g in groups if len(g) >= k), 0)
>>> F(items, 0.5, 3)
0.1

Sort a list based on a given distribution

Answering one Question, I ended up with a problem that I believe was a circumlocution way of solving which could have been done in a better way, but I was clueless
There are two list
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
optimal_partition, is one of the integer partition of the number 8 into 4 parts
I would like to sort optimal_partition, in a manner which matches the percentage distribution to as closest as possible which would mean, the individual partition should match the percent magnitude as closest as possible
So 3 -> 0.4, 2 -> 0.27 and 0.23 and 1 -> 0.1
So the final result should be
[2, 2, 3, 1]
The way I ended up solving this was
>>> percent = [0.23, 0.27, 0.4, 0.1]
>>> optimal_partition = [3, 2, 2, 1]
>>> optimal_partition_percent = zip(sorted(optimal_partition),
sorted(enumerate(percent),
key = itemgetter(1)))
>>> optimal_partition = [e for e, _ in sorted(optimal_partition_percent,
key = lambda e: e[1][0])]
>>> optimal_partition
[2, 2, 3, 1]
Can you suggest an easier way to solve this?
By easier I mean, without the need to implement multiple sorting, and storing and later rearranging based on index.
Couple of more examples:
percent = [0.25, 0.25, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
result = [2, 2, 3, 1]
percent = [0.2, 0.2, 0.4, 0.2]
optimal_partition = [3, 2, 2, 1]
result = [1, 2, 3, 2]
from numpy import take,argsort
take(opt,argsort(argsort(perc)[::-1]))
or without imports:
zip(*sorted(zip(sorted(range(len(perc)), key=perc.__getitem__)[::-1],opt)))[1]
#Test
l=[([0.23, 0.27, 0.4, 0.1],[3, 2, 2, 1]),
([0.25, 0.25, 0.4, 0.1],[3, 2, 2, 1]),
([0.2, 0.2, 0.4, 0.2],[3, 2, 2, 1])]
def f1(perc,opt):
return take(opt,argsort(argsort(perc)[::-1]))
def f2(perc,opt):
return zip(*sorted(zip(sorted(range(len(perc)),
key=perc.__getitem__)[::-1],opt)))[1]
for i in l:
perc, opt = i
print f1(perc,opt), f2(perc,opt)
# output:
# [2 2 3 1] (2, 2, 3, 1)
# [2 2 3 1] (2, 2, 3, 1)
# [1 2 3 2] (1, 2, 3, 2)
Use the fact that the percentages sum to 1:
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
total = sum(optimal_partition)
output = [total*i for i in percent]
Now you need to figure out a way to redistribute the fractional components somehow. Thinking out loud:
from operator import itemgetter
intermediate = [(i[0], int(i[1]), i[1] - int(i[1])) for i in enumerate(output)]
# Sort the list by the fractional component
s = sorted(intermediate, key=itemgetter(2))
# Now, distribute the first item's fractional component to the rest, starting at the top:
for i, tup in enumerate(s):
fraction = tup[2]
# Go through the remaining items in reverse order
for index in range(len(s)-1, i, -1):
this_fraction = s[index][2]
if fraction + this_fraction >= 1:
# increment this item by 1, clear the fraction, carry the remainder
new_fraction = fraction + this_fraction -1
s[index][1] = s[index][1] + 1
s[index][2] = 0
fraction = new_fraction
else:
#just add the fraction to this element, clear the original element
s[index][2] = s[index][2] + fraction
Now, I'm not sure I'd say that's "easier". I haven't tested it, and I'm sure I got the logic wrong in that last section. In fact, I'm attempting assignment to tuples, so I know there's at least one error. But it's a different approach.

Numpy bincount() with floats

I am trying to get a bincount of a numpy array which is of the float type:
w = np.array([0.1, 0.2, 0.1, 0.3, 0.5])
print np.bincount(w)
How can you use bincount() with float values and not int?
You need to use numpy.unique before you use bincount. Otherwise it's ambiguous what you're counting. unique should be much faster than Counter for numpy arrays.
>>> w = np.array([0.1, 0.2, 0.1, 0.3, 0.5])
>>> uniqw, inverse = np.unique(w, return_inverse=True)
>>> uniqw
array([ 0.1, 0.2, 0.3, 0.5])
>>> np.bincount(inverse)
array([2, 1, 1, 1])
Since version 1.9.0, you can use np.unique directly:
w = np.array([0.1, 0.2, 0.1, 0.3, 0.5])
values, counts = np.unique(w, return_counts=True)
You want something like this?
>>> from collections import Counter
>>> w = np.array([0.1, 0.2, 0.1, 0.3, 0.5])
>>> c = Counter(w)
Counter({0.10000000000000001: 2, 0.5: 1, 0.29999999999999999: 1, 0.20000000000000001: 1})
or, more nicely output:
Counter({0.1: 2, 0.5: 1, 0.3: 1, 0.2: 1})
You can then sort it and get your values:
>>> np.array([v for k,v in sorted(c.iteritems())])
array([2, 1, 1, 1])
The output of bincount wouldn't make sense with floats:
>>> np.bincount([10,11])
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
as there is no defined sequence of floats.

Categories