Numpy aggregate into bins, then calculate sum? - python

I have a matrix that looks like this:
M = [[1, 200],
[1.8, 100],
[2, 500],
[2.5, 300],
[3, 400],
[3.5, 200],
[5, 200],
[8, 100]]
I want to group the rows by a bin size (applied to the left column), e.g. for a bin size 2 (first bin is values from 0-2, second bin from 2-4, third bin from 4-6 etc):
[[1, 200],
[1.8, 100],
----
[2, 500],
[2.5, 300],
[3, 400],
[3.5, 200],
----
[5, 200],
----
[8, 100]]
Then output a new matrix with the sum of the right columns for each group:
[200+100, 500+300+400+200, 200, 100]
What is an efficient way to sum each value based on the bin_size boundaries?

With pandas:
Make a DataFrame and then use integer division to define your bins:
import pandas as pd
df = pd.DataFrame(M)
df.groupby(df[0]//2)[1].sum()
#0
#0.0 300
#1.0 1400
#2.0 200
#4.0 100
#Name: 1, dtype: int64
Use .tolist() to get your desired output:
df.groupby(df[0]//2)[1].sum().tolist()
#[300, 1400, 200, 100]
With numpy.bincount
import numpy as np
gp, vals = np.transpose(M)
gp = (gp//2).astype(int)
np.bincount(gp, vals)
#array([ 300., 1400., 200., 0., 100.])

You can make use of np.digitize and a scipy.sparse.csr_matrix here:
bins = [2, 4, 6, 8, 10]
b = np.digitize(M[:, 0], bins)
v = M[:, 1]
Now using a vectorized groupby using a csr_matrix
from scipy import sparse
sparse.csr_matrix(
(v, b, np.arange(v.shape[0]+1)), (v.shape[0], b.max()+1)
).sum(0)
matrix([[ 300., 1400., 200., 0., 100.]])

Related

Find index of a row in numpy array

Given m x n numpy array
X = np.array([
[1, 2],
[10, 20],
[100, 200]
])
how to find index of a row, i.e. [10, 20] -> 1?
n could any - 2, 3, ..., so I can have n x 3 arrays
Y = np.array([
[1, 2, 3],
[10, 20, 30],
[100, 200, 300]
])
so I need to pass a vector of size n, in this case n=3, i.e a vector [10, 20, 30] to get its row index 1? Again, n could be of any value, like 100 or 1000.
Numpy arrays could be big, so I don't want to convert them to lists to use .index()
Just in case that the query array contains duplicate rows that you are looking for, the function below returns multiple indices in such case.
def find_rows(source, target):
return np.where((source == target).all(axis=1))[0]
looking = [10, 20, 30]
Y = np.array([[1, 2, 3],
[10, 20, 30],
[100, 200, 300],
[10, 20, 30]])
print(find_rows(source=Y, target=looking)) # [1, 3]
You can use numpy.equal, which will broadcast and compare row vector against each row of the original array, and if all elements of a row are equal to the target, the row is identical to the target:
import numpy as np
np.flatnonzero(np.equal(X, [10, 20]).all(1))
# [1]
np.flatnonzero(np.equal(Y, [10, 20, 30]).all(1))
# [1]
You can make a function as follow:
def get_index(seq, *arrays):
for array in arrays:
try:
return np.where(array==seq)[0][0]
except IndexError:
pass
then:
>>>get_index([10,20,30],Y)
1
Or with just indexing:
>>>np.where((Y==[10,20,30]).all(axis=1))[0]
1

Rolling sum for a window of 2 days

I am trying to compute a rolling 2 day using trans_date sum against the amount column that is grouped by ID within the table below using python.
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>50</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td> </td></tr></tbody></table>
Ultimately, I am trying to achieve the result below using
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th>2d_Running_Total</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td>700</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>250</td><td>950</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td>350</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td>950</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td>15</td><td> </td></tr></tbody></table>
This hyperlink was very close to solving this, but the issue is for the records that have multiple transactions on the same day, it provides the same value for the same day.
https://python-forum.io/Thread-Rolling-sum-for-a-window-of-2-days-Pandas
This should do it:
import pandas as pd
# create dummy data
df = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount'],
data = [
[1, '03/23/2019', '06:51:03', 100],
[1, '03/24/2019', '12:32:48', 600],
[1, '03/24/2019', '14:15:35', 250],
[1, '06/05/2019', '16:18:21', 75],
[2, '02/01/2019', '18:02:52', 200],
[2, '02/02/2019', '10:03:02', 150],
[2, '02/03/2019', '23:47:51', 800],
[3, '01/18/2019', '11:12:58', 1000],
[3, '01/23/2019', '22:12:41', 15]
]
)
df_out = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount', '2d_Running_Total'],
data = [
[1, '03/23/2019', '06:51:03', 100, 100],
[1, '03/24/2019', '12:32:48', 600, 700],
[1, '03/24/2019', '14:15:35', 250, 950],
[1, '06/05/2019', '16:18:21', 75, 75],
[2, '02/01/2019', '18:02:52', 200, 200],
[2, '02/02/2019', '10:03:02', 150, 350],
[2, '02/03/2019', '23:47:51', 800, 950],
[3, '01/18/2019', '11:12:58', 1000, 1000]
]
)
# convert into datetime object and set as index
df['Trans_DateTime'] = pd.to_datetime(df['Trans_Date'] + ' ' + df['Trans_Time'])
df = df.set_index('Trans_DateTime')
# group by ID and apply rolling window to the amount column
df['2d_Running_Total'] = df.groupby('ID')['Amount'].rolling('2d').sum().values.astype(int)
df.reset_index(drop=True)

How to vectorise a list of matrix vector multiplications using pytorch/numpy

For example, I have a list of N B x H tensor(i.e. a N x B x H tensor) and a list of N vectors (i.e. N x B tensor). And I want multiply each B x H tensor in the list with corresponding B dimensional tensor, resulting a N x H tensor.
I know how to use a single for-loop with PyTorch to implement the computation, but is there any vectorised implantation? (i.e. no for-loop, just using PyTorch/numpy operations)
You could achieve this with torch.bmm() and some torch.squeeze()/torch.unsqueeze().
I am personally rather fond of the more generictorch.einsum() (which I find more readable):
import torch
import numpy as np
A = torch.from_numpy(np.array([[[1, 10, 100], [2, 20, 200], [3, 30, 300]],
[[4, 40, 400], [5, 50, 500], [6, 60, 600]]]))
B = torch.from_numpy(np.array([[ 1, 2, 3],
[-1, -2, -3]]))
AB = torch.einsum("nbh,nb->nh", (A, B))
print(AB)
# tensor([[ 14, 140, 1400],
# [ -32, -320, -3200]])

Tensorflow, how to multiply a 2D tensor (matrix) by corresponding elements in a 1D vector

I have a 2D matrix M of shape [batch x dim], I have a vector V of shape [batch]. How can I multiply each of the columns in the matrix by the corresponding element in the V? That is:
I know an inefficient numpy implementation would look like this:
import numpy as np
M = np.random.uniform(size=(4, 10))
V = np.random.randint(4)
def tst(M, V):
rows = []
for i in range(len(M)):
col = []
for j in range(len(M[i])):
col.append(M[i][j] * V[i])
rows.append(col)
return np.array(rows)
In tensorflow, given two tensors, what is the most efficient way to achieve this?
import tensorflow as tf
sess = tf.InteractiveSession()
M = tf.constant(np.random.normal(size=(4,10)), dtype=tf.float32)
V = tf.constant([1,2,3,4], dtype=tf.float32)
In NumPy, we would need to make V 2D and then let broadcasting do the element-wise multiplication (i.e. Hadamard product). I am guessing, it should be the same on tensorflow. So, for expanding dims on tensorflow, we can use tf.newaxis (on newer versions) or tf.expand_dims or a reshape with tf.reshape -
tf.multiply(M, V[:,tf.newaxis])
tf.multiply(M, tf.expand_dims(V,1))
tf.multiply(M, tf.reshape(V, (-1, 1)))
In addition to #Divakar's answer, I would like to make a note that the order of M and V don't matter. It seems that tf.multiply also does broadcasting during multiplication.
Example:
In [55]: M.eval()
Out[55]:
array([[1, 2, 3, 4],
[2, 3, 4, 5],
[3, 4, 5, 6]], dtype=int32)
In [56]: V.eval()
Out[56]: array([10, 20, 30], dtype=int32)
In [57]: tf.multiply(M, V[:,tf.newaxis]).eval()
Out[57]:
array([[ 10, 20, 30, 40],
[ 40, 60, 80, 100],
[ 90, 120, 150, 180]], dtype=int32)
In [58]: tf.multiply(V[:, tf.newaxis], M).eval()
Out[58]:
array([[ 10, 20, 30, 40],
[ 40, 60, 80, 100],
[ 90, 120, 150, 180]], dtype=int32)

Get minimum x and y from 2D numpy array of points

Given a numpy 2D array of points, aka 3D array with size of the 3rd dimension equals to 2, how do I get the minimum x and y coordinate over all points?
Examples:
First:
I edited my original example, since it was wrong.
data = np.array(
[[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[11, 12],
[13, 14],
[15, 16]]])
minx = 0 # data[0][0][0]
miny = 1 # data[0][0][1]
4 x 4 x 2:
Second:
array([[[ 0, 77],
[29, 12],
[28, 71],
[46, 17]],
[[45, 76],
[33, 82],
[14, 17],
[ 3, 18]],
[[99, 40],
[96, 3],
[74, 60],
[ 4, 57]],
[[67, 57],
[23, 81],
[12, 12],
[45, 98]]])
minx = 0 # data[0][0][0]
miny = 3 # data[2][1][1]
Is there an easy way to get now the minimum x and y coordinates of all points of the data? I played around with amin and different axis values, but nothing worked.
Clarification:
My array stores positions from different robots over time. First dimension is time, second is the index of an robot. The third dimension is then either x or y of a robots for a given time.
Since I want to draw their paths to pixels, I need to normalize my data, so that the points are as close as possible to the origin without getting negative. I thought that subtracting [minx,miny] from every point will do that for me.
alko's answer didn't work for me, so here's what I did:
import numpy as np
array = np.arange(15).reshape(5,3)
x,y = np.unravel_index(np.argmin(array),array.shape)
Seems you need consecutive min alongaxis. For your first example:
>>> np.min(np.min(data, axis=1), axis=0)
array([ 0, 1])
For the second:
>>> np.min(np.min(data, axis=1), axis=0)
array([0, 3])
The same expression can be stated (in numpy older than 1.7), as pointed out by #Jamie, s
>>> np.min(data, axis=(1, 0))
array([0, 3])
You should take the min of a slice of the array.
Assuming first coordinate is x and second is y
minx = min(a[:,0])
miny = min(a[:,1])
>>>a=np.array([[1,2],[3,4],[5,6]])
>>> a
array([[1, 2],
[3, 4],
[5, 6]])
>>> min(a[:,0])
1
>>> max(a[:,0])
5
>>> max(a[:,1])
6
minx = np.min(data[0])
miny = np.min(data[1])
You can use function numpy.amin to find the minima along the desired axis.

Categories