I have a dataframe (df) that has three columns (user, vector, and group name), the vector column with multiple comma-separated values in each row.
df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5', 'user_6'], 'vector': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]], 'group': ['A', 'B', 'C', 'B', 'A', 'A']})
I would like to calculate for each group, the sum of dimensions in all rows divided by the total number of rows for this group.
For example:
For group, A is [(1+3+6)/3, (0+8+0)/3, (2+0+0)/3, (0+0+2)/3] = [3.3, 2.6, 0.6, 0.6].
For group, B is [(1+5)/2, (8+0)/2, (0+2)/2, (2+2)/2] = [3,4,1,2].
For group, C is [6, 2, 0, 0]
So, the expected result is an array:
group A: [3.3, 2.6, 0.6, 0.6]
group B: [3,4,1,2]
group C: [6, 2, 0, 0]
I'm not sure if you were looking for the results stored in a single array/dataframe, or if you're just looking to get the results as separate arrays.
If the latter, something like this should work for you:
for group in df.group.unique():
print(f'Group {group} results: ')
tmp_df = pd.DataFrame(df[df.group==group]['vector'].tolist())
print(tmp_df.mean().values)
Output:
Group A results:
[3.33333333 2.66666667 0.66666667 0.66666667]
Group B results:
[3. 4. 1. 2.]
Group C results:
[6. 2. 0. 0.]
It's a little clunky, but gets the job done if you're just looking to get the results.
Filters the dataframe based on group, then turns the vectors of that into it's own tmp_df and gets the mean for each column.
If you want you could easily take those arrays and save them for further manipulation or what have you.
Hope that helps!
Take advantage of numpy:
import numpy as np
out = (df.groupby('group')['vector']
.agg(lambda x: np.vstack(x).mean(0).round(2))
)
print(out)
Output:
group
A [3.33, 2.67, 0.67, 0.67]
B [3.0, 4.0, 1.0, 2.0]
C [6.0, 2.0, 0.0, 0.0]
Name: vector, dtype: object
as DataFrame
out = (df.groupby('group', as_index=False)['vector']
.agg(lambda x: np.vstack(x).mean(0).round(2))
)
Output:
group vector
0 A [3.33, 2.67, 0.67, 0.67]
1 B [3.0, 4.0, 1.0, 2.0]
2 C [6.0, 2.0, 0.0, 0.0]
as array
out = np.vstack(df.groupby('group')['vector']
.agg(lambda x: np.vstack(x).mean(0).round(2))
)
Output:
[[3.33 2.67 0.67 0.67]
[3. 4. 1. 2. ]
[6. 2. 0. 0. ]]
Related
I am curious to know if there are any more optimal ways to compute this "rolling weighted sum" (unsure what the actual terminology is, but I will provide an example to further clarify). I am asking this because I am certain that my current code snippet is not coded in the most optimal way with respect to memory usage, and there is opportunity to improve its performance by using numpy's more advanced functions.
Example:
import numpy as np
A = np.append(np.linspace(0, 1, 10), np.linspace(1.1, 2, 30))
np.random.seed(0)
B = np.random.randint(3, size=40) + 1
# list of [(weight, (lower, upper))]
d = [(1, (-0.25, -0.20)), (0.5, (-0.20, -0.10)), (2, (-0.10, 0.15))]
In Python 3.7:
## A
array([0. , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
0.55555556, 0.66666667, 0.77777778, 0.88888889, 1. ,
1.1 , 1.13103448, 1.16206897, 1.19310345, 1.22413793,
1.25517241, 1.2862069 , 1.31724138, 1.34827586, 1.37931034,
1.41034483, 1.44137931, 1.47241379, 1.50344828, 1.53448276,
1.56551724, 1.59655172, 1.62758621, 1.65862069, 1.68965517,
1.72068966, 1.75172414, 1.78275862, 1.8137931 , 1.84482759,
1.87586207, 1.90689655, 1.93793103, 1.96896552, 2. ])
## B
array([1, 2, 1, 2, 2, 3, 1, 3, 1, 1, 1, 3, 2, 3, 3, 1, 2, 2, 2, 2, 1, 2,
1, 1, 2, 3, 1, 3, 1, 2, 2, 3, 1, 2, 2, 2, 1, 3, 1, 3])
Expected Solution:
array([ 6. , 6.5, 8. , 10.5, 12. , 11. , 11.5, 11.5, 6.5, 13.5, 25. ,
27.5, 30.5, 34.5, 37.5, 36. , 35. , 35. , 34. , 34.5, 34. , 36.5,
33. , 34. , 34.5, 34.5, 36. , 39. , 37. , 36. , 37. , 36.5, 37.5,
39. , 36.5, 37.5, 34. , 31. , 27.5, 23. ])
The logic I want to translate into code:
Let's look at how 10.5 (the fourth element in the expected solution) is computed. d represents a collection of nested tuples with first float element weight, and second tuple element bounds (in the form of (lower, upper)).
We look at the fourth element of A (0.33333333) and apply bounds for each tuple in d. For the first tuple in d:
0.33333333 + (-0.25) = 0.08333333
0.33333333 + (-0.20) = 0.13333333
We go back to A to see if there are any elements between bounds (0.08333333, 0.1333333). Because the second element of A (0.11111111) falls in this range, we pull the second element of B (2) and multiply it by its weight from d (1) and add it to the second element of the expected output.
After iterating across all tuples in d, the fourth element of the expected output is computed as:
1 * 2 + 0.5 * 1 + 2 * (2 + 2) = 10.5
Here is my attempted code:
D = np.zeros(len(A))
for v in d:
weight, (_lower, _upper) = v
lower, upper = A + _lower, A + _upper
_A = np.tile(A, (len(A), 1))
__A = np.bitwise_and(_A > lower.reshape(-1, 1), _A < upper.reshape(-1, 1))
D += weight * (__A # B)
D
Hopefully this makes sense. Please feel free to ask clarifying questions. Thanks!
Since intervals (-0.25, -0.20), (-0.20, -0.10) and (-0.10, 0.15) are actually subintervals of partition of an interval (-0.25, 0.15) you could find indices where elements should be inserted in A to maintain order. They specify slices of B to perform addition on. In short:
partition = np.array([-0.25, -0.20, -0.10, 0.15])
weights = np.array([1, 0.5, 2])
out = []
for n in A:
idx = np.searchsorted(A, n + partition)
results = np.add.reduceat(B[:idx[-1]], idx[:-1])
out.append(np.dot(results, weights))
>>> print(out)
[7.5, 7.5, 8.0, 10.5, 12.0, 11.0, 11.5, 11.5, 6.5, 13.5, 27.5, 27.5, 31.5, 35.5, 37.5, 37.0, 36.0, 35.0, 34.0, 34.5, 34.0, 36.5, 33.0, 34.0, 34.5, 34.5, 36.0, 39.0, 37.0, 36.0, 37.0, 36.5, 37.5, 39.0, 36.5, 37.5, 34.0, 31.0, 27.5, 23.0]
Note that results are wrong if there are empty slices of B
Credits to #mathfux for providing me enough guidance. Here's the final code solution that I developed based on conversations here:
partition = np.array([-0.25, -0.20, -0.10, 0.15])
weights = np.array([1, 0.5, 2])
idx = np.searchsorted(A, partition + A[:, None])
_idx = np.lib.stride_tricks.sliding_window_view(idx, 2, axis = 1)
values = np.apply_along_axis(lambda x: B[slice(*(x))].sum(), 2, _idx)
values # weights
I have a pytorch tensor
t = torch.tensor(
[[1.0, 1.5, 0.5, 2.0],
[5.0, 3.0, 4.5, 5.5],
[0.5, 1.0, 3.0, 2.0]]
)
t[:, [-1]] gives me last column value of each row:
tensor([[2.0000],
[5.5000],
[2.0000]])
However, I want to slice values at different columns per row. For example, in t for the 1st, 2nd and 3rd row, I want to slice at 2, -1, 0 index respectively to get the following tensor:
tensor([[0.5],
[5.5],
[0.5]])
How can I do it in torch?
t[[i for i in range(3)], [2, -1, 0]]
The list comprehension creates a list filled with row indexes, then you specify the column index for every row.
you can use the following:
t = torch.tensor(
[[1.0, 1.5, 0.5, 2.0],
[5.0, 3.0, 4.5, 5.5],
[0.5, 1.0, 3.0, 2.0]]
)
t
>tensor([[1.0000, 1.5000, 0.5000, 2.0000],
[5.0000, 3.0000, 4.5000, 5.5000],
[0.5000, 1.0000, 3.0000, 2.0000]])
rows = [0, 1, 2]
cols = [2, -1, 0]
t[rows, cols]
>tensor([0.5000, 5.5000, 0.5000])
I would like to reduce a variable number of elements (or slices) of an array multiple times, and put the result into a new array. Kind of like a masked np.apply_along_axis, but we stay in numpy
For example, to reduce by mean:
to_reduce = np.array([
[0, 1, 1, 0, 0],
[0, 0, 0, 1, 1],
[1, 0, 1, 0, 1],
[1, 1, 1, 1, 0]]).astype(np.bool8)
arr = np.array([
[1.0, 2.0, 3.0],
[1.0, 2.0, 4.0],
[2.0, 2.0, 3.0],
[2.0, 2.0, 4.0],
[1.0, 0.0, 3.0]])
I want:
np.array([
[1.5, 2.0, 3.5],
[1.5, 1.0, 3.5],
[1.33333, 1.33333, 3.0],
[1.5, 2.0, 3.5]])
The slow way would be:
out = np.empty((4, 3))
for j, mask in enumerate(to_reduce):
out[j] = np.mean(arr[mask], axis=0)
Here's one simple and efficient way with matrix-multiplication -
In [56]: to_reduce.dot(arr)/to_reduce.sum(1)[:,None]
Out[56]:
array([[1.5 , 2. , 3.5 ],
[1.5 , 1. , 3.5 ],
[1.33333333, 1.33333333, 3. ],
[1.5 , 2. , 3.5 ]])
Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".
I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)
For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]
I have the following problem with sorting a 2D array using the function argsort.
More precisely, let's assume I have 5 points and have calculated the euclidean distances between them, which are stored in the 2D array D:
D=np.array([[0,0.3,0.4,0.2,0.5],[0.3,0,0.2,0.6,0.1],
[0.4,0.2,0,0.5,0],[0.2,0.6,0.5,0,0.7],[0.5,0.1,0,0.7,0]])
D
array([[ 0. , 0.3, 0.4, 0.2, 0.5],
[ 0.3, 0. , 0.2, 0.6, 0.1],
[ 0.4, 0.2, 0. , 0.5, 0. ],
[ 0.2, 0.6, 0.5, 0. , 0.7],
[ 0.5, 0.1, 0. , 0.7, 0. ]])
Each element D[i,j] (i,j=0,...,4) shows the distance between point i and point j. The diagonal entries are of course equal to zero, as they show the distance of a point to itself. However, there can be 2 or more points which overlap. For instance, in this particular case, point 4 is located in the same position of point 2, so that the the distances D[2,4] and D[4,2] are equal to zero.
Now, I want to sort this array D: for each point i I want to know the indices of its neighbour points, from the closest to the furthest one. Of course, for a given point i the first point/index in the sorted array should be i, i.e. the the closest point to point 1 is 1. I used the function argsort:
N = np.argsort(D)
N
array([[0, 3, 1, 2, 4],
[1, 4, 2, 0, 3],
[2, 4, 1, 0, 3],
[3, 0, 2, 1, 4],
[2, 4, 1, 0, 3]])
This function sorts the distances properly until it gets to point 4: the first entry of the 4th row (counting from zero) is not 4 (D[4,4]=0) as I would like. I would like the 4th row to be [4, 2, 1, 0, 3]. The first entry is 2, because points 2 and 4 overlap so that D[2,4]=D[4,2], and between the same value entries D[2,4]=0 and D[4,2]=0, argsort selects always the first one.
Is there a way to fix this so that the sorted array N[i,j] of D[i,j] always starts with the indices corresponding to the diagonal entries D[i,i]=0?
Thank you for your help,
MarcoC
One way would be to fill the diagonal elements with something lesser than global minimum and then use argsort -
In [286]: np.fill_diagonal(D,D.min()-1) # Or use -1 for filling
# if we know beforehand that the global minimum is 0
In [287]: np.argsort(D)
Out[287]:
array([[0, 3, 1, 2, 4],
[1, 4, 2, 0, 3],
[2, 4, 1, 0, 3],
[3, 0, 2, 1, 4],
[4, 2, 1, 0, 3]])
If you don't want the input array to be changed, make a copy and then do the diagonal filling.
How about this:
import numpy as np
D = np.array([[ 0. , 0.3, 0.4, 0.2, 0.5],
[ 0.3, 0. , 0.2, 0.6, 0.1],
[ 0.4, 0.2, 0. , 0.5, 0. ],
[ 0.2, 0.6, 0.5, 0. , 0.7],
[ 0.5, 0.1, 0. , 0.7, 0. ]])
s = np.argsort(D)
line = np.argwhere(s[:,0] != np.arange(D.shape[0]))[0,0]
column = np.argwhere(s[line,:] == line)[0,0]
s[line,0], s[line, column] = s[line, column], s[line,0]
Just find the lines that don't have the dioganal element in front using numpy.argwhere, then the column to swap and then swap the elements. Then s contains what you want in the end.
This works for your example. In a general case, where numpy.argwhere can contain several elements, one would have to run a loop over those elements instead of just typing [0,0] at the end of the two lines of code above.
Hope I could help.