Python: Group same integer values, then average - python

I have a huge array of data and I would like to do subgroups for the values for same integers and then take their average.
For example:
a = [0, 0.5, 1, 1.5, 2, 2.5]
I want to take sub groups as follows:
[0, 0.5] [1, 1.5] [2, 2.5]
... and then take the average and put all the averages in a new array.

Assuming you want to group by the number's integer value (so the number rounded down), something like this could work:
>>> a = [0, 0.5, 1, 1.5, 2, 2.5]
>>> groups = [list(g) for _, g in itertools.groupby(a, int)]
>>> groups
[[0, 0.5], [1, 1.5], [2, 2.5]]
Then averaging becomes:
>>> [sum(grp) / len(grp) for grp in groups]
[0.25, 1.25, 2.25]
This assumes a is already sorted, as in your example.
Ref: itertools.groupby, list comprehensions.

If you have no problem using additional libraries:
import pandas as pd
import numpy as np
a = [0, 0.5, 1, 1.5, 2, 2.5]
print(pd.Series(a).groupby(np.array(a, dtype=np.int32)).mean())
Gives:
0 0.25
1 1.25
2 2.25
dtype: float64

If you want an approach with dictionary, you can go ahead like this:
dic={}
a = [0, 0.5, 1, 1.5, 2, 2.5]
for items in a:
if int(items) not in dic:
dic[int(items)]=[]
dic[int(items)].append(items)
print(dic)
for items in dic:
dic[items]=sum(dic[items])/len(dic[items])
print(dic)

You can use groupby to easily get that (you might need to sort the list first):
from itertools import groupby
from statistics import mean
a = [0, 0.5, 1, 1.5, 2, 2.5]
for k, group in groupby(a, key=int):
print(mean(group))
Will give:
0.25
1.25
2.25

Related

Adding a column to a pandas dataframe based on other columns

Problem description
Introductory remark: For the code have a look below
Let's say we have a pandas dataframe consisting of 3 columns and 2 rows.
I'd like to add a 4th column called 'Max_LF' that will consist of an array. The value of the cell is retrieved by having a look at the column 'Max_WD'. For the first row that would be 0.35 which will than be compared to the values in the column 'WD' where 0.35 can be found at the third position. Therefore, the third value of the column 'LF' should be written into the column 'Max_LF'. If the value of 'Max_WD' occures multiple times in 'WD', then all corresponding items of 'LF' should be written into 'Max_LF'.
Failed attempt
So far I had various attemps on first retrieving the index of the item in 'Max_WD' in 'WD'. After potentially retrieving the index the idea was to then get the items of 'LF' via their index:
df4['temp_indices'] = [i for i, x in enumerate(df4['WD']) if x == df4['Max_WD']]
However, a ValueError occured:
raise ValueError('Lengths must match to compare')
ValueError: Lengths must match to compare
This is what the example dateframe looks like
df = pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41]})
The expected outcome should look like
df=pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41], 'Max_LF': [[3] ,[2,3], [3,4]]})
You could get it by simply using lambda as follows
df['Max_LF'] = df.apply(lambda x : [i + 1 for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
output is
LF Max_WD WD Max_LF
0 [1, 2, 3] 0.35 [0.28, 0.34, 0.35, 0.18] [3]
1 [1, 2, 3] 0.45 [0.42, 0.45, 0.45, 0.18] [2, 3]
2 [1, 2, 3] 0.41 [0.31, 0.21, 0.41, 0.41] [3, 4]
Thanks guys! With your help I was able to solve my problem.
Like Prince Francis suggested I first did
df['temp'] = df.apply(lambda x : [i for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
to get the indicees of the 'WD'-values in 'LF'. In a second stept I then could add the actual column 'Max_LF' by doing
df['LF_Max'] = df.apply(lambda x: [x['LF'][e] for e in (x['temp'])],axis=1)
Thanks a lot guys!
You can achieve it by applying a function over axis 1.
For this, I recommend you to first convert the WD list into a pd.Series (or a numpy.ndarray) and then compare all the values at once.
Assuming that you want a list of all the values higher than the threshold, you could use this:
>>> def get_max_wd(x):
... wd = pd.Series(x.WD)
... return list(wd[wd >= x.Max_WD])
...
>>> df.apply(get_max_wd, axis=1)
0 [0.35]
1 [0.45, 0.45]
2 [0.41, 0.41]
dtype: object
The result of the apply can then be assigned as a new column into the dataframe:
df['Max_LF'] = df.apply(get_max_wd, axis=1)
If what you are after is only the maximum value (see my comment above), you can use the max() method within the function.

Faster method for iterating through a numpy array of numpy arrays

I have a numpy array of numpy arrays like the following example:
data = [[0.4, 1.5, 2.6],
[3.4, 0.2, 0.0],
[null, 3.2, 1.0],
[1.0, 4.6, null]]
I would like an efficient way of returning the row index, column index and value if the value meets a condition.
I need the row and column values because I feed them into func_which_returns_lat_long_based_on_row_and_column(column, row) which is applied if the value meets a condition.
Finally I would like to append the value, and outputs of the function to my_list.
I have solved my problem with the nested for loop solution shown below but it is slow. I believe I should be using np.where() however I cannot figure that out.
my_list = []
for ii, array in enumerate(data):
for jj, value in enumerate(array):
if value > 1:
lon , lat = func_which_returns_lat_long_based_on_row_and_column(jj,ii)
my_list.append([value, lon, lat])
I'm hoping there is a more efficient solution than the one I'm using above.
import numpy as np
import warnings
warnings.filterwarnings('ignore')
data = [[0.4, 1.5, 2.6],
[3.4, 0.2, 0.0],
[np.nan, 3.2, 1.0],
[1.0, 4.6, np.nan]]
x = np.array(data)
i, j = np.where(x > 1 )
for a, b in zip(i, j):
print('lon: {} lat: {} value: {}'.format(a, b, x[a,b]))
Output is
lon: 0 lat: 1 value: 1.5
lon: 0 lat: 2 value: 2.6
lon: 1 lat: 0 value: 3.4
lon: 2 lat: 1 value: 3.2
lon: 3 lat: 1 value: 4.6
As there is np.nan in comparison, there will be RuntimeWarning.
you can use
result = np.where(arr == 15)
it will return a np array of indices where element is in arr
try to build a function that works on arrays. For instance a function that adds to every element of the data the corresonding column and row index could look like:
import numpy as np
def func_which_returns_lat_long_based_on_row_and_column(data,indices):
# returns element of data + columna and row index
return data + indices[:,:,0] + indices[:,:,1]
data = np.array([[0.4, 1.5, 2.6],
[3.4, 0.2, 0.0],
[np.NaN, 3.2, 1.0],
[1.0, 4.6, np.NaN]])
# create a matrix of the same shape as data (plus an additional dim because they are two indices)
# with the corresponding indices of the element in it
x_range = np.arange(0,data.shape[0])
y_range = np.arange(0,data.shape[1])
grid = np.meshgrid(x_range,y_range, indexing = 'ij')
indice_matrix = np.concatenate((grid[0][:,:,None],grid[1][:,:,None]),axis=2)
# for instance:
# indice_matrix[0,0] = np.array([0,0])
# indice_matrix[1,0] = np.array([1,0])
# indice_matrix[1,3] = np.array([1,3])
# calculate the output
out = func_which_returns_lat_long_based_on_row_and_column(data,indice_matrix)
data.shape
>> (4,3)
indice_matrix.shape
>> (4, 3, 2)
indice_matrix
>>> array([[[0, 0],
[0, 1],
[0, 2]],
[[1, 0],
[1, 1],
[1, 2]],
[[2, 0],
[2, 1],
[2, 2]],
[[3, 0],
[3, 1],
[3, 2]]])
indice_matrix[2,1]
>> array([2, 1])

Python convert a vector into inverse frequencies

So given this numpy array:
import numpy as np
vector = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
# len(vector) == 12
# 2 x ones, 4 x two, 6 x three
How can I convert this into a vector of inverse frequencies?
Such that for each value, the output contains 1 divided by the frequency of that value:
array([0.16, 0.33, 0.33, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.33, 0.33, 0.16])
[Update to a general one]
How about this one using np.histogram:
import numpy as np
l = np.array([1,2,2,3,3,3,3,3,3,2,2,1])
_u, _l = np.unique(l, return_inverse=True)
np.histogram(_l, bins=np.arange(_u.size+1))[0][_l] / _l.size
This essentially requires a grouping operation, which numpy isn't great at... but pandas is. You can do this with groupby + transform + count, and divide the result by the length of vector.
import pandas as pd
s = pd.Series(vector)
vector = (s.groupby(s).transform('count') / len(s)).values
vector
array([ 0.16666667, 0.33333333, 0.33333333, 0.5 , 0.5 ,
0.5 , 0.5 , 0.5 , 0.5 , 0.33333333,
0.33333333, 0.16666667])
You can use collections.Counter to first determine the frequency of each element. Then create an intermediate mapping dict which will contain key as the element and value as the frequency. Finally using numpy.vectorize to transform the array to your desired format
>>> import numpy as np
>>> from collections import Counter
>>> v = np.array([1, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1])
>>> freq_dict = Counter(v)
At this point the freq_dict will contains frequency of each element like
>>> freq_dict
>>> Counter({3: 6, 2: 4, 1: 2})
Next build a probability dict of the format element: probability, using dict comprehension
>>> prob_dict = dict((k,round(val/len(v),3)) for k, val in freq_dict.items())
>>> prob_dict
>>> {1: 0.167, 2: 0.333, 3: 0.5}
Finally using numpy.vectorize to get your desired output
>>> out = np.vectorize(prob_dict.get)(v)
This will produce:
>>> out
>>> array([ 0.167, 0.333, 0.333, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.333, 0.333, 0.167])

Sum and aggregate over a list with equal values

I have a pair lists of the same length, the first containing int values and the second contains float values. I wish to replace these with another pair of lists which may be shorter, but still have the same length, in which the first list will contain only unique values, and the second list will contain the sums for each matching value. That is, if the i'th element of the first list in the new pair is x, and the indices in the first list of the original pair in which x has appeared are i_1,...,i_k, then the i'th element of the second list in the new pair should contain the sum of the values in indices i_1,...,i_k in the second list of the original pair.
An example will clarify.
Input:
([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
Ourput:
([1, 2, 3], [1.0, 0.5, 1.0])
I was trying to do some list comprehension trick here but failed. I can write a silly loop function for that, but I believe there should be something much nicer here.
Not a one-liner, but since you've not posted your solution I'll suggest this solution that is using collections.OrderedDict:
>>> from collections import OrderedDict
>>> a, b = ([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
>>> d = OrderedDict()
>>> for k, v in zip(a, b):
... d[k] = d.get(k, 0) + v
...
>>> d.keys(), d.values()
([1, 2, 3], [1.0, 0.5, 1.0])
Of course if order doesn't matter then it's better to use collections.defaultdict:
>>> from collections import defaultdict
>>> a, b = ([1, 'foo', 'foo', 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
>>> d = defaultdict(int)
>>> for k, v in zip(a, b):
d[k] += + v
...
>>> d.keys(), d.values()
([3, 1, 'foo'], [1.0, 1.0, 0.5])
one way to go is using pandas:
>>> import pandas as pd
>>> df = pd.DataFrame({'tag':[1, 2, 2, 1, 1, 3],
'val':[0.1, 0.2, 0.3, 0.4, 0.5, 1.0]})
>>> df
tag val
0 1 0.1
1 2 0.2
2 2 0.3
3 1 0.4
4 1 0.5
5 3 1.0
>>> df.groupby('tag')['val'].aggregate('sum')
tag
1 1.0
2 0.5
3 1.0
Name: val, dtype: float64
Build a map with the keys:
la,lb = ([1, 2, 2, 1, 1, 3], [0.1, 0.2, 0.3, 0.4, 0.5, 1.0])
m = {k:0.0 for k in la}
And fill it with the summations:
for i in xrange(len(lb)):
m[la[i]] += lb[i]
Finally, from your map:
zip(*[(k,m[k]) for k in m]*1)

Sort a list based on a given distribution

Answering one Question, I ended up with a problem that I believe was a circumlocution way of solving which could have been done in a better way, but I was clueless
There are two list
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
optimal_partition, is one of the integer partition of the number 8 into 4 parts
I would like to sort optimal_partition, in a manner which matches the percentage distribution to as closest as possible which would mean, the individual partition should match the percent magnitude as closest as possible
So 3 -> 0.4, 2 -> 0.27 and 0.23 and 1 -> 0.1
So the final result should be
[2, 2, 3, 1]
The way I ended up solving this was
>>> percent = [0.23, 0.27, 0.4, 0.1]
>>> optimal_partition = [3, 2, 2, 1]
>>> optimal_partition_percent = zip(sorted(optimal_partition),
sorted(enumerate(percent),
key = itemgetter(1)))
>>> optimal_partition = [e for e, _ in sorted(optimal_partition_percent,
key = lambda e: e[1][0])]
>>> optimal_partition
[2, 2, 3, 1]
Can you suggest an easier way to solve this?
By easier I mean, without the need to implement multiple sorting, and storing and later rearranging based on index.
Couple of more examples:
percent = [0.25, 0.25, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
result = [2, 2, 3, 1]
percent = [0.2, 0.2, 0.4, 0.2]
optimal_partition = [3, 2, 2, 1]
result = [1, 2, 3, 2]
from numpy import take,argsort
take(opt,argsort(argsort(perc)[::-1]))
or without imports:
zip(*sorted(zip(sorted(range(len(perc)), key=perc.__getitem__)[::-1],opt)))[1]
#Test
l=[([0.23, 0.27, 0.4, 0.1],[3, 2, 2, 1]),
([0.25, 0.25, 0.4, 0.1],[3, 2, 2, 1]),
([0.2, 0.2, 0.4, 0.2],[3, 2, 2, 1])]
def f1(perc,opt):
return take(opt,argsort(argsort(perc)[::-1]))
def f2(perc,opt):
return zip(*sorted(zip(sorted(range(len(perc)),
key=perc.__getitem__)[::-1],opt)))[1]
for i in l:
perc, opt = i
print f1(perc,opt), f2(perc,opt)
# output:
# [2 2 3 1] (2, 2, 3, 1)
# [2 2 3 1] (2, 2, 3, 1)
# [1 2 3 2] (1, 2, 3, 2)
Use the fact that the percentages sum to 1:
percent = [0.23, 0.27, 0.4, 0.1]
optimal_partition = [3, 2, 2, 1]
total = sum(optimal_partition)
output = [total*i for i in percent]
Now you need to figure out a way to redistribute the fractional components somehow. Thinking out loud:
from operator import itemgetter
intermediate = [(i[0], int(i[1]), i[1] - int(i[1])) for i in enumerate(output)]
# Sort the list by the fractional component
s = sorted(intermediate, key=itemgetter(2))
# Now, distribute the first item's fractional component to the rest, starting at the top:
for i, tup in enumerate(s):
fraction = tup[2]
# Go through the remaining items in reverse order
for index in range(len(s)-1, i, -1):
this_fraction = s[index][2]
if fraction + this_fraction >= 1:
# increment this item by 1, clear the fraction, carry the remainder
new_fraction = fraction + this_fraction -1
s[index][1] = s[index][1] + 1
s[index][2] = 0
fraction = new_fraction
else:
#just add the fraction to this element, clear the original element
s[index][2] = s[index][2] + fraction
Now, I'm not sure I'd say that's "easier". I haven't tested it, and I'm sure I got the logic wrong in that last section. In fact, I'm attempting assignment to tuples, so I know there's at least one error. But it's a different approach.

Categories