Related
I have the following example array of x-y coordinate pairs:
A = np.array([[0.33703753, 3.],
[0.90115394, 5.],
[0.91172016, 5.],
[0.93230994, 3.],
[0.08084283, 3.],
[0.71531777, 2.],
[0.07880787, 3.],
[0.03501083, 4.],
[0.69253184, 4.],
[0.62214452, 3.],
[0.26953094, 1.],
[0.4617873 , 3.],
[0.6495549 , 0.],
[0.84531478, 4.],
[0.08493308, 5.]])
My goal is to reduce this to an array with six rows by taking the average of the x-values for each y-value, like so:
array([[0.6495549 , 0. ],
[0.26953094, 1. ],
[0.71531777, 2. ],
[0.41882167, 3. ],
[0.52428582, 4. ],
[0.63260239, 5. ]])
Currently I am achieving this by converting to a pandas dataframe, performing the calculation, and converting back to a numpy array:
>>> df = pd.DataFrame({'x':A[:, 0], 'y':A[:, 1]})
>>> df.groupby('y').mean().reset_index()
y x
0 0.0 0.649555
1 1.0 0.269531
2 2.0 0.715318
3 3.0 0.418822
4 4.0 0.524286
5 5.0 0.632602
Is there a way to perform this calculation using numpy, without having to resort to the pandas library?
Here's a completely vectorized solution that only uses numpy methods and no python iteration:
sort_indices = np.argsort(A[:, 1])
unique_y, unique_indices, group_count = np.unique(A[sort_indices, 1], return_index=True, return_counts=True)
Once we have the indices and counts of all the unique elements, we can use the np.ufunc.reduceat method to collect the results of np.add for each group, and then divide by their counts to get the mean:
group_sum = np.add.reduceat(A[sort_indices, :], unique_indices, axis=0)
group_mean = group_sum / group_count[:, None]
# array([[0.6495549 , 0. ],
# [0.26953094, 1. ],
# [0.71531777, 2. ],
# [0.41882167, 3. ],
# [0.52428582, 4. ],
# [0.63260239, 5. ]])
Benchmarks:
Comparing this solution with the other answers here (Code at tio.run) for
A contains 10k rows, with A[:, 1] containing N groups, N varies from 1 to 10k
A contains N rows (N varies from 1 to 10k), with A[:, 1] containing min(N, 1000) groups
Observations:
The numpy-only solutions (Dani's and mine) win easily -- they are significantly faster than the pandas approach (possibly since the time taken to create the dataframe is an overhead that doesn't exist for the former).
The pandas solution is slower than the python+numpy solutions (Jaimu's and mine) for smaller arrays, since it's faster to just iterate in python and get it over with than to create a dataframe first, but these solutions become much slower than pandas as the array size or number of groups increases.
Note: The previous version of this answer iterated over the groups as returned by the accepted answer to Is there any numpy group by function? and individually calculated the mean:
First, we need to sort the array on the column you want to group by
A_s = A[A[:, 1].argsort(), :]
Then, run that snippet. np.split splits its first argument at the indices given by the second argument.
unique_elems, unique_indices = np.unique(A_s[:, 1], return_index=True)
# (array([0., 1., 2., 3., 4., 5.]), array([ 0, 1, 2, 3, 9, 12]))
split_indices = unique_indices[1:] # No need to split at the first index
groups = np.split(A_s, split_indices)
# [array([[0.6495549, 0. ]]),
# array([[0.26953094, 1. ]]),
# array([[0.71531777, 2. ]]),
# array([[0.33703753, 3. ],
# [0.93230994, 3. ],
# [0.08084283, 3. ],
# [0.07880787, 3. ],
# [0.62214452, 3. ],
# [0.4617873 , 3. ]]),
# array([[0.03501083, 4. ],
# [0.69253184, 4. ],
# [0.84531478, 4. ]]),
# array([[0.90115394, 5. ],
# [0.91172016, 5. ],
# [0.08493308, 5. ]])]
Now, groups is a list containing multiple np.arrays. Iterate over the list and mean each array:
means = np.zeros((len(groups), groups[0].shape[1]))
for i, grp in enumerate(groups):
means[i, :] = grp.mean(axis=0)
# array([[0.6495549 , 0. ],
# [0.26953094, 1. ],
# [0.71531777, 2. ],
# [0.41882167, 3. ],
# [0.52428582, 4. ],
# [0.63260239, 5. ]])
Here is a work around using numpy.
unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.empty((unique_ys.shape[0], 2))
for i, y in enumerate(unique_ys):
result[i, 0] = np.mean(A[indices == i, 0])
result[i, 1] = y
print(result)
Alternative:
To make the code more pythonic, you can use a list comprehension to create the result array, instead of using a for loop.
unique_ys, indices = np.unique(A[:, 1], return_inverse=True)
result = np.array([[np.mean(A[indices == i, 0]), y] for i, y in enumerate(unique_ys)])
print(result)
Output:
[[0.6495549 0. ]
[0.26953094 1. ]
[0.71531777 2. ]
[0.41882167 3. ]
[0.52428582 4. ]
[0.63260239 5. ]]
Use np.bincount + np.unique:
sums = np.bincount(A[:, 1].astype(np.int64), weights=A[:, 0])
values, counts = np.unique(A[:, 1], return_counts=True)
res = np.vstack((sums / counts, values)).T
print(res)
Output
[[0.6495549 0. ]
[0.26953094 1. ]
[0.71531777 2. ]
[0.41882167 3. ]
[0.52428582 4. ]
[0.63260239 5. ]]
If you know the y values beforehand, you could try to match the array for each:
for example:
A[(A[:,1]==1),0] will give you all the x values where the y value is equal to 1.
So you could go through each value of y, sum the A[:,1]==y[n] to get the number of matches, sum the x values that match, divide to make the average, and place in a new array:
B=np.zeros([6,2])
for i in range( 6):
nmatch=sum(A[:,1]==i)
nsum=sum(A[(A[:,1]==i),0])
B[i,0]=i
B[i,1]=nsum/nmatch
There must be a more pythonic way of doing this ....
This is a two part question.
Part 1
Given the following Numpy array:
foo = array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
what is the most efficient way to (i) find the two minimal possible sums of values across columns (taking into account cell values greater than zero only) where for every column only one row is used and (ii) keep track of the array index locations visited on that route?
For example, in the example above this would be: minimum_bar = 22.5 + 20 + 50 + 2.5 = 95 at indices [0,0], [0,1], [2,2], [4,3] and next_best_bar = 22.5 + 20 + 50 + 8 = 100.5 at indices [0,0], [0,1], [2,2], [1,3].
Part 2
Similar to Part 1 but now with the constraint that the row-wise of sums of foo (if that row is used in the solution) must be greater than the values in an array (for example np.array([10, 10, 10, 10, 10]). In other words sum(row[0])>array[0]=62.5>10=True but sum(row[4])>array[4]=2.5>10=False.
In which case the result is: minimum_bar = 22.5 + 20 + 50 + 9.9 = 102.4 at indices [0,0], [0,1], [2,2], [2,3] and next_best_bar = 22.5 + 20 + 50 + 20 = 112.5 at indices [0,0], [0,1], [2,2], [0,3].
My initial approach was to find all possible routes (combinations of indices using itertools) but this solution does not scale well for large matrix sizes (e.g., mxn=500x500).
Here's one solution that I came up with (hopefully I didn't misunderstand anything in your question)
def minimum_routes(foo):
assert len(foo) >= 2
assert np.all(np.any(foo > 0, axis=0))
foo = foo.astype(float)
foo[foo <= 0] = np.inf
foo.sort(0)
minimum_bar = foo[0]
next_best_bar = minimum_bar.copy()
c = np.argmin(np.abs(foo[0] - foo[1]))
next_best_bar[c] = foo[1, c]
return minimum_bar, next_best_bar
Let's test it:
foo = np.array([[22.5, 20. , 0. , 20. ],
[24. , 40. , 0. , 8. ],
[ 0. , 0. , 50. , 9.9],
[ 0. , 0. , 0. , 9. ],
[ 0. , 0. , 0. , 2.5]])
# PART 1
minimum_bar, next_best_bar = minimum_routes(foo)
# (array([22.5, 20. , 50. , 2.5]), array([24. , 20. , 50. , 2.5]))
# PART 2
constraint = np.array([10, 10, 10, 10, 10])
minimum_bar, next_best_bar = minimum_routes(foo[foo.sum(1) > constraint])
# (array([22.5, 20. , 50. , 8. ]), array([24., 20., 50., 8.]))
To find the indices:
np.where(foo == minimum_bar)
np.where(foo == next_best_bar)
I have this big serie of length t (t = 200K rows)
prices = [200, 100, 500, 300 ..]
and I want to calculate a matrix (tXt) where a value is calculated as:
matrix[i][j] = prices[j]/prices[i] - 1
I tried this using a double for, but it's too slow. Any ideas how to perform it better?
for p0 in prices:
for p1 in prices:
matrix[i][j] = p1/p0 - 1
A vectorized solution is using np.meshgrid, with prices and 1/prices as arguments (note that prices must be an array), and multiplying the result and substracting 1 in order to compute matrix[i][j] = prices[j]/prices[i] - 1:
a, b = np.meshgrid(p, 1/p)
a * b - 1
As an example:
p = np.array([1,4,2])
Would give:
a, b = np.meshgrid(p, 1/p)
a * b - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
Quick check of some of the cells:
(i,j) prices[j]/prices[i] - 1
--------------------------------
(1,1) 1/1 - 1 = 0
(1,2) 4/1 - 1 = 3
(1,3) 2/1 - 1 = 1
(2,1) 1/4 - 1 = -0.75
Another solution:
[p] / np.array([p]).T - 1
array([[ 0. , 3. , 1. ],
[-0.75, 0. , -0.5 ],
[-0.5 , 1. , 0. ]])
There are two idiomatic ways of doing an outer product-type operation. Either use the .outer method of universal functions, here np.divide:
In [2]: p = np.array([10, 20, 30, 40])
In [3]: np.divide.outer(p, p)
Out[3]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
Alternatively, use broadcasting:
In [4]: p[:, None] / p[None, :]
Out[4]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
This p[None, :] itself can be spelled as a reshape, p.reshape((1, len(p))), but readability.
Both are equivalent to a double for-loop:
In [6]: o = np.empty((len(p), len(p)))
In [7]: for i in range(len(p)):
...: for j in range(len(p)):
...: o[i, j] = p[i] / p[j]
...:
In [8]: o
Out[8]:
array([[ 1. , 0.5 , 0.33333333, 0.25 ],
[ 2. , 1. , 0.66666667, 0.5 ],
[ 3. , 1.5 , 1. , 0.75 ],
[ 4. , 2. , 1.33333333, 1. ]])
I guess it can be done in this way
import numpy
prices = [200., 300., 100., 500., 600.]
x = numpy.array(prices).reshape(1, len(prices))
matrix = (1/x.T) * x - 1
Let me explain in details. This matrix is a matrix product of column vector of element-wise reciprocal price values and a row vector of original price values. Then matrix of ones of the same size needs to be subtracted from the result.
First of all we create row-vector from prices list
x = numpy.array(prices).reshape(1, len(prices))
Reshaping is required here. Otherwise your vector will have shape (len(prices),), not required (1, len(prices)).
Then we compute a column vector of element-wise reciprocal price values:
(1/x.T)
Finally, we compute the resulting matrix
matrix = (1/x.T) * x - 1
Here ending - 1 will be broadcasted to a matrix of the same shape with (1/x.T) * x.
I have the following numpy array:
foo = np.array([[0.0, 10.0], [0.13216, 12.11837], [0.25379, 42.05027], [0.30874, 13.11784]])
which yields:
[[ 0. 10. ]
[ 0.13216 12.11837]
[ 0.25379 42.05027]
[ 0.30874 13.11784]]
How can I normalize the Y component of this array. So it gives me something like:
[[ 0. 0. ]
[ 0.13216 0.06 ]
[ 0.25379 1 ]
[ 0.30874 0.097]]
Referring to this Cross Validated Link, How to normalize data to 0-1 range?, it looks like you can perform min-max normalisation on the last column of foo.
v = foo[:, 1] # foo[:, -1] for the last column
foo[:, 1] = (v - v.min()) / (v.max() - v.min())
foo
array([[ 0. , 0. ],
[ 0.13216 , 0.06609523],
[ 0.25379 , 1. ],
[ 0.30874 , 0.09727968]])
Another option for performing normalisation (as suggested by OP) is using sklearn.preprocessing.normalize, which yields slightly different results -
from sklearn.preprocessing import normalize
foo[:, [-1]] = normalize(foo[:, -1, None], norm='max', axis=0)
foo
array([[ 0. , 0.2378106 ],
[ 0.13216 , 0.28818769],
[ 0.25379 , 1. ],
[ 0.30874 , 0.31195614]])
sklearn.preprocessing.MinMaxScaler can also be used (feature_range=(0, 1) is default):
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
v = foo[:,1]
v_scaled = min_max_scaler.fit_transform(v)
foo[:,1] = v_scaled
print(foo)
Output:
[[ 0. 0. ]
[ 0.13216 0.06609523]
[ 0.25379 1. ]
[ 0.30874 0.09727968]]
Advantage is that scaling to any range can be done.
I think you want this:
foo[:,1] = (foo[:,1] - foo[:,1].min()) / (foo[:,1].max() - foo[:,1].min())
You are trying to min-max scale between 0 and 1 only the second column.
Using sklearn.preprocessing.minmax_scale, should easily solve your problem.
e.g.:
from sklearn.preprocessing import minmax_scale
column_1 = foo[:,0] #first column you don't want to scale
column_2 = minmax_scale(foo[:,1], feature_range=(0,1)) #second column you want to scale
foo_norm = np.stack((column_1, column_2), axis=1) #stack both columns to get a 2d array
Should yield
array([[0. , 0. ],
[0.13216 , 0.06609523],
[0.25379 , 1. ],
[0.30874 , 0.09727968]])
Maybe you want to min-max scale between 0 and 1 both columns. In this case, use:
foo_norm = minmax_scale(foo, feature_range=(0,1), axis=0)
Which yields
array([[0. , 0. ],
[0.42806245, 0.06609523],
[0.82201853, 1. ],
[1. , 0.09727968]])
note: Not to be confused with the operation that scales the norm (length) of a vector to a certain value (usually 1), which is also commonly referred to as normalization.
I have a matrix A with shape 1.6M rows and 400 columns.
One of the columns in A (call it the output column) has binary values (0,1) with a predominance of 0's.
I want to create a new matrix B (same shape as A) by sampling rows in A with replacement such, that the distribution of 0's & 1's in the output column in B becomes 50/50.
What is the efficient way to do this using python/numpy?
You could do this by:
Creating a list of all rows with 0 in the "output column" (called outputZeros), and a list of all rows with 1 in the output column (called outputOnes); then,
Sampling with replacement from outputZeros and outputOnes 1.6M times.
Here's a small example. It's not clear to me if you want the rows in B to be in any particular order, so here they first include 0s, then include 1s.
In [1]: import numpy as np, random
In [2]: A = np.random.rand(10, 2)
In [3]: A
In [4]: A[:7, 1] = 0
In [5]: A[7:, 1] = 1
In [6]: A
Out[6]:
array([[ 0.70126052, 0. ],
[ 0.51161067, 0. ],
[ 0.76511966, 0. ],
[ 0.91257144, 0. ],
[ 0.97024895, 0. ],
[ 0.55817776, 0. ],
[ 0.55963466, 0. ],
[ 0.6318139 , 1. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ]])
In [7]: outputZeros = np.where(A[:, 1] == 0)[0]
In [8]: outputZeros
Out[8]: array([0, 1, 2, 3, 4, 5, 6])
In [9]: outputOnes
Out[9]: array([7, 8, 9])
In [10]: outputOnes = np.where(A[:, 1] == 1)[0]
In [11]: B = np.zeros((10, 2))
In [12]: for i in range(10):
if i < 5:
B[i, :] = A[random.choice(outputZeros), :]
else:
B[i, :] = A[random.choice(outputOnes), :]
....:
In [13]: B
Out[13]:
array([[ 0.97024895, 0. ],
[ 0.97024895, 0. ],
[ 0.76511966, 0. ],
[ 0.76511966, 0. ],
[ 0.51161067, 0. ],
[ 0.90176108, 1. ],
[ 0.76033151, 1. ],
[ 0.6318139 , 1. ],
[ 0.6318139 , 1. ],
[ 0.76033151, 1. ]])
I would create a new 1D numpy array filled with values from numpy.random.random_integers(low, high=None, size=None) and swap that new array with the old one.