Related
I have a numpy/pandas list of values:
a = np.random.randint(-100, 100, 10000)
b = a/100
I want to apply a custom cumsum function, but I haven't found a way to do it without loops. The custom function sets an upper limit of 1 and lower limit of -1 for the cumsum values, if the "add" to sum is beyond these limits the "add" becomes 0.
In the case that sum is between the limits of -1 and 1 but the "added" value would break beyond the limits, the "added" becomes the remainder to -1 or 1.
Here is the loop version:
def cumsum_with_limits(values):
cumsum_values = []
sum = 0
for i in values:
if sum+i <= 1 and sum+i >= -1:
sum += i
cumsum_values.append(sum)
elif sum+i >= 1:
d = 1-sum # Remainder to 1
sum += d
cumsum_values.append(sum)
elif sum+i <= -1:
d = -1-sum # Remainder to -1
sum += d
cumsum_values.append(sum)
return cumsum_values
Is there any way to vectorize this? I need to run this function on large datasets and performance is my current issue. Appreciate any help!
Update: Fixed the code a bit, and a little clarification for the outputs:
Using np.random.seed(0), the first 6 values are:
b = [0.72, -0.53, 0.17, 0.92, -0.33, 0.95]
Expected output:
o = [0.72, 0.19, 0.36, 1, 0.67, 1]
Loops aren't necessarily undesirable. If performance is an issue, consider numba. There's a ~330x improvement without materially changing your logic:
from numba import njit
np.random.seed(0)
a = np.random.randint(-100, 100, 10000)
b = a/100
#njit
def cumsum_with_limits_nb(values):
n = len(values)
res = np.empty(n)
sum_val = 0
for i in range(n):
x = values[i]
if (sum_val+x <= 1) and (sum_val+x >= -1):
res[i] = x
sum_val += x
elif sum_val+x >= 1:
d = 1-sum_val # Remainder to 1
res[i] = d
sum_val += d
elif sum_val+x <= -1:
d = -1-sum_val # Remainder to -1
res[i] = d
sum_val += d
return res
assert np.isclose(cumsum_with_limits(b), cumsum_with_limits_nb(b)).all()
If you don't mind sacrificing some performance, you can rewrite this loop more succinctly:
#njit
def cumsum_with_limits_nb2(values):
n = len(values)
res = np.empty(n)
sum_val = 0
for i in range(n):
x = values[i]
next_sum = sum_val + x
if np.abs(next_sum) >= 1:
x = np.sign(next_sum) - sum_val
res[i] = x
sum_val += x
return res
With similar performance to nb2, here's an alternative (thanks to #jdehesa):
#njit
def cumsum_with_limits_nb3(values):
n = len(values)
res = np.empty(n)
sum_val = 0
for i in range(n):
x = min(max(sum_val + values[i], -1) , 1) - sum_val
res[i] = x
sum_val += x
return res
Performance comparisons:
assert np.isclose(cumsum_with_limits(b), cumsum_with_limits_nb(b)).all()
assert np.isclose(cumsum_with_limits(b), cumsum_with_limits_nb2(b)).all()
assert np.isclose(cumsum_with_limits(b), cumsum_with_limits_nb3(b)).all()
%timeit cumsum_with_limits(b) # 12.5 ms per loop
%timeit cumsum_with_limits_nb(b) # 40.9 µs per loop
%timeit cumsum_with_limits_nb2(b) # 54.7 µs per loop
%timeit cumsum_with_limits_nb3(b) # 54 µs per loop
Start with a regular cumsum:
b = ...
s = np.cumsum(b)
Find the first clip point:
i = np.argmax((s[0:] > 1) | (s[0:] < -1))
Adjust everything that follows:
s[i:] += (np.sign(s[i]) - s[i])
Rinse and repeat. This still requires a loop, but only over the adjustment points, which is generally expected to be much smaller than the total number of array size.
b = ...
s = np.cumsum(b)
while True:
i = np.argmax((s[0:] > 1) | (s[0:] < -1))
if np.abs(s[i]) <= 1:
break
s[i:] += (np.sign(s[i]) - s[i])
I still haven't found a way to completely pre-compute the adjustment points up front, so I would have to guess that the numba solution will be faster than this, even if it you compiled this with numba.
Starting with np.seed(0), your original example has 3090 adjustment points, which is approximately 1/3. Unfortunately, with all the temp arrays and extra sums, that makes the algorithmic complexity of my solution tend to O(n2). This is completely unacceptable.
I thought I had already answered the generic question of "cumulative sum with bounds" in the past, but I can't find it.
This solution also uses numba and is a bit more general (custom bounds) and concise than the ones given by #jpp.
It operates on the OP's problem (10K values, bounds at -1, 1) in 40 µs.
import numpy as np
from numba import njit
#njit
def cumsum_clip(a, xmin=-np.inf, xmax=np.inf):
res = np.empty_like(a)
c = 0
for i in range(len(a)):
c = min(max(c + a[i], xmin), xmax)
res[i] = c
return res
Example
np.random.seed(0)
x = np.random.randint(-100, 100, 10_000) / 100
>>> x[:6]
array([ 0.72, -0.53, 0.17, 0.92, -0.33, 0.95])
>>> cumsum_clip(x, -1, 1)[:6]
array([0.72, 0.19, 0.36, 1. , 0.67, 1. ])
%timeit cumsum_clip(x, -1, 1)
39.3 µs ± 31 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Note: you can specify other bounds, e.g.:
>>> cumsum_clip(x, 0, 1)[:10]
array([0.72, 0.19, 0.36, 1. , 0.67, 1. , 1. , 0.09, 0. , 0. ])
Or omit one of the bounds (for example here specifying only an upper bound):
>>> cumsum_clip(x, xmax=1)[:10]
array([ 0.72, 0.19, 0.36, 1. , 0.67, 1. , 1. , 0.09, -0.7 , -1.34])
Of course, it preserves the original dtype:
np.random.seed(0)
x = np.random.randint(-10, 10, 10)
>>> cumsum_clip(x, 0, 10)
array([ 2, 7, 0, 0, 0, 0, 0, 9, 10, 4])
>>> cumsum_clip(x, 0, 10).dtype
dtype('int64')
This is an extension of the question posed here (quoted below)
I have a matrix (2d numpy ndarray, to be precise):
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
And I want to roll each row of A independently, according to roll
values in another array:
r = np.array([2, 0, -1])
That is, I want to do this:
print np.array([np.roll(row, x) for row,x in zip(A, r)])
[[0 0 4]
[1 2 3]
[0 5 0]]
Is there a way to do this efficiently? Perhaps using fancy indexing
tricks?
The accepted solution was:
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
result = A[rows, column_indices]
I would basically like to do the same thing, except when an index gets rolled "past" the end of the row, I would like the other side of the row to be padded with a NaN, rather than the value move to the "front" of the row in a periodic fashion.
Maybe using np.pad somehow? But I can't figure out how to get that to pad different rows by different amounts.
Inspired by Roll rows of a matrix independently's solution, here's a vectorized one based on np.lib.stride_tricks.as_strided -
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),np.nan)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Sample run -
In [76]: a
Out[76]:
array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]])
In [77]: r
Out[77]: array([ 2, 0, -1])
In [78]: strided_indexing_roll(a, r)
Out[78]:
array([[nan, nan, 4.],
[ 1., 2., 3.],
[ 0., 5., nan]])
I was able to hack this together with linear indexing...it gets the right result but performs rather slowly on large arrays.
A = np.array([[4, 0, 0],
[1, 2, 3],
[0, 0, 5]]).astype(float)
r = np.array([2, 0, -1])
rows, column_indices = np.ogrid[:A.shape[0], :A.shape[1]]
# Use always a negative shift, so that column_indices are valid.
# (could also use module operation)
r_old = r.copy()
r[r < 0] += A.shape[1]
column_indices = column_indices - r[:,np.newaxis]
result = A[rows, column_indices]
# replace with NaNs
row_length = result.shape[-1]
pad_inds = []
for ind,i in np.enumerate(r_old):
if i > 0:
inds2pad = [np.ravel_multi_index((ind,) + (j,),result.shape) for j in range(i)]
pad_inds.extend(inds2pad)
if i < 0:
inds2pad = [np.ravel_multi_index((ind,) + (j,),result.shape) for j in range(row_length+i,row_length)]
pad_inds.extend(inds2pad)
result.ravel()[pad_inds] = nan
Gives the expected result:
print result
[[ nan nan 4.]
[ 1. 2. 3.]
[ 0. 5. nan]]
Based on #Seberg and #yann-dubois answers in the non-nan case, I've written a method that:
Is faster than the current answer
Works on ndarrays of any shape (specify the row-axis using the axis argument)
Allows for setting fill to either np.nan, any other "fill value" or False to allow regular rolling across the array edge.
Benchmarking
cols, rows = 1024, 2048
arr = np.stack(rows*(np.arange(cols,dtype=float),))
shifts = np.random.randint(-cols, cols, rows)
np.testing.assert_array_almost_equal(row_roll(arr, shifts), strided_indexing_roll(arr, shifts))
# True
%timeit row_roll(arr, shifts)
# 25.9 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit strided_indexing_roll(arr, shifts)
# 29.7 ms ± 446 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def row_roll(arr, shifts, axis=1, fill=np.nan):
"""Apply an independent roll for each dimensions of a single axis.
Parameters
----------
arr : np.ndarray
Array of any shape.
shifts : np.ndarray, dtype int. Shape: `(arr.shape[:axis],)`.
Amount to roll each row by. Positive shifts row right.
axis : int
Axis along which elements are shifted.
fill: bool or float
If True, value to be filled at missing values. Otherwise just rolls across edges.
"""
if np.issubdtype(arr.dtype, int) and isinstance(fill, float):
arr = arr.astype(float)
shifts2 = shifts.copy()
arr = np.swapaxes(arr,axis,-1)
all_idcs = np.ogrid[[slice(0,n) for n in arr.shape]]
# Convert to a positive shift
shifts2[shifts2 < 0] += arr.shape[-1]
all_idcs[-1] = all_idcs[-1] - shifts2[:, np.newaxis]
result = arr[tuple(all_idcs)]
if fill is not False:
# Create mask of row positions above negative shifts
# or below positive shifts. Then set them to np.nan.
*_, nrows, ncols = arr.shape
mask_neg = shifts < 0
mask_pos = shifts >= 0
shifts_pos = shifts.copy()
shifts_pos[mask_neg] = 0
shifts_neg = shifts.copy()
shifts_neg[mask_pos] = ncols+1 # need to be bigger than the biggest positive shift
shifts_neg[mask_neg] = shifts[mask_neg] % ncols
indices = np.stack(nrows*(np.arange(ncols),))
nanmask = (indices < shifts_pos[:, None]) | (indices >= shifts_neg[:, None])
result[nanmask] = fill
arr = np.swapaxes(result,-1,axis)
return arr
i am stuck at a problem. I have two 2-D numpy arrays, filled with x and y coordinates. Those arrays might look like:
array1([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]],
array2([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]],
Now I have to find "duplicate nodes". However, i also have to consider nodes as equal within a given tolerance of the coordinates, therefore, i can't use solutions like this . Since my arrays are quite big (~200.000 lines each) two simple for loops are not an option as well. My final output should look like this:
output([[(1.23, 5.63)],
[(2.32, 7.65)]],
I would appreciate some hints.
Cheers,
In order to compare to nodes with a giving tolerance I recommend to use numpy.isclose(), where you can set a relative and absolute tolerance.
numpy.isclose(1.24, 1.25, atol=1e-1)
# [True]
numpy.isclose([1.24, 2.31], [1.25, 2.32], atol=1e-1)
# [True, True]
Instead of using a two for loops, you can make use of itertools.product() package, to go through all pairs. The following code does what you want:
array1 = np.array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
array2 = np.array([[1.23, 5.63],
[6.31, 10.63],
[2.32, 7.64]])
output = np.empty((0,2))
for i0, i1 in itertools.product(np.arange(array1.shape[0]),
np.arange(array2.shape[0])):
if np.all(np.isclose(array1[i0], array2[i1], atol=1e-1)):
output = np.concatenate((output, [array2[i1]]), axis=0)
# output = [[ 1.23 5.63]
# [ 2.32 7.64]]
Defining a isclose function similar to numpy.isclose, but a bit faster (mostly due to not checking any input and not supporting both relative and absolute tolerance):
import numpy as np
array1 = np.array([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]])
array2 = np.array([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]])
def isclose(x, y, atol):
return np.abs(x - y) < atol
Now comes the hard part. We need to calculate if any two values are close within the inner most dimension. For this I reshape the arrays in such a way that the first array has its values along the second dimension, replicated across the first and the second array has its values along the first dimension, replicated along the second (note the 1, 3 and 3, 1):
In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]:
array([[[ True, True],
[False, False],
[False, False]],
[[False, False],
[False, False],
[False, False]],
[[False, False],
[ True, True],
[False, False]]], dtype=bool)
Now we want all entries where the value is close to any other value (along the same dimension):
In [93]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0)
Out[93]:
array([[ True, True],
[ True, True],
[False, False]], dtype=bool)
Then we want only those where both values of the tuple are close:
In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True, True, False], dtype=bool)
And finally, we can use this to index array1:
In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]:
array([[[ 1.22, 5.64]],
[[ 2.31, 7.63]]])
If you want to, you can swap the any and all calls. One might be faster than the other in your case.
The 3 in the reshape calls needs to be substituted for the actual length of your data.
This algorithm will have the same bad runtime of the other answer using itertools.product, but at least the actual looping is done implicitly by numpy and is implemented in C. This is visible in the timings:
In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Where the pares function is the code defined by #Ferran Parés in another answer and the arrays as already reshaped there.
And for larger arrays it becomes more obvious:
array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)
In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In the end this is limited by the available system memory. My machine (16GB RAM) can still handle arrays of length 20000, but that pushes it almost to 100%. It also takes about 12s:
In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are many possible ways to define that tolerance. Since, we are talking about XY coordinates, most probably we are talking about euclidean distances to set that tolerance value. So, we can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance. The implementation would look something like this -
from scipy.spatial import cKDTree
# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
# Get closest distances for each pt in b
dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor
# Check the distances against the given tolerance value and
# thus filter out rows off b for the final output
return b[dist <= tol]
Sample step-by-step run -
# Input 2D arrays
In [68]: a
Out[68]:
array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
In [69]: b
Out[69]:
array([[ 1.23, 5.63],
[ 6.31, 10.63],
[ 2.32, 7.65]])
# Get closest distances for each pt in b
In [70]: dist = cKDTree(a).query(b, k=1)[0]
In [71]: dist
Out[71]: array([0.01414214, 5. , 0.02236068])
# Mask of distances within the given tolerance
In [72]: tol = 1
In [73]: dist <= tol
Out[73]: array([ True, False, True])
# Finally filter out valid ones off b
In [74]: b[dist <= tol]
Out[74]:
array([[1.23, 5.63],
[2.32, 7.65]])
Timings on 200,000 pts -
In [20]: N = 200000
...: np.random.seed(0)
...: a = np.random.rand(N,2)
...: b = np.random.rand(N,2)
In [21]: %timeit intersect_close(a, b)
1 loop, best of 3: 1.37 s per loop
As commented, scaling and rounding your numbers might allow you to use intersect1d or the equivalent.
And if you have just 2 columns, it might work to turn it into a 1d array of complex dtype.
But you might also want to keep in mind what intersect1d does:
if not assume_unique:
# Might be faster than unique( intersect1d( ar1, ar2 ) )?
ar1 = unique(ar1)
ar2 = unique(ar2)
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]
unique has been enhanced to handle rows (axis parameters), but intersect has not. In any case it uses argsort to put similar elements next to each other, and then skips the duplicates.
Notice that intersect concatenenates the unique arrays, sorts, and again finds the duplicates.
I know you didn't want a loop version, but to promote conceptualization of the problem here's one anyways:
In [581]: a = np.array([(1.22, 5.64),
...: (2.31, 7.63),
...: (4.94, 4.15)])
...:
...: b = np.array([(1.23, 5.63),
...: (6.31, 10.63),
...: (2.32, 7.65)])
...:
I removed a layer of nesting in your arrays.
In [582]: c = []
In [583]: for a1 in a:
...: for b1 in b:
...: if np.allclose(a1,b1, atol=0.5): c.append((a1,b1))
or as list comprehension
In [586]: [(a1,b1) for a1 in a for b1 in b if np.allclose(a1,b1,atol=0.5)]
Out[586]:
[(array([1.22, 5.64]), array([1.23, 5.63])),
(array([2.31, 7.63]), array([2.32, 7.65]))]
complex approximation
In [604]: aa = (a*10).astype(int)
In [605]: aa
Out[605]:
array([[12, 56],
[23, 76],
[49, 41]])
In [606]: ac=aa[:,0]+1j*aa[:,1]
In [607]: bb = (b*10).astype(int)
In [608]: bc=bb[:,0]+1j*bb[:,1]
In [609]: np.intersect1d(ac,bc)
Out[609]: array([12.+56.j, 23.+76.j])
intersect inspired
Concatenate the arrays, sort them, take difference, and find the small differences:
In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]:
array([[ 4.94, 4.15],
[ 1.23, 5.63],
[ 1.22, 5.64],
[ 2.31, 7.63],
[ 2.32, 7.65],
[ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]:
array([[-3.71, 1.48],
[-0.01, 0.01],
[ 1.09, 1.99],
[ 0.01, 0.02],
[ 3.99, 2.98]])
In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1) # refine with abs
Out[623]: array([False, True, False, True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]:
array([[2.31, 7.63],
[1.23, 5.63]])
May be you could try this using pure NP and self defined function:
import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
print(xDA)
print(yDA)
#Match x to y
def np_matrix(myx,myy,calp=0.2):
Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)
# define a caliper
matches = {}
dist = np.abs(Xxx - Yyy)
for m in range(0, myx.size):
if (np.min(dist[:, m]) <= calp) or not calp:
matches[m] = np.argmin(dist[:, m])
return matches
alwd_dist=0.1
xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)
shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
print("No Matched Items based on given allowed distance:",alwd_dist)
else:
print("Matched:")
for ke in shared_items:
print(xDA[ke[0]],yDA[ke[1]])
I have a data set in numpy with a x vector and a y vector. The y vectors is only two values +1 or -1 (or 0 or 1) because its a binary valued function. I know I can just loop over the data set and if I see a +1 to map it to 1 and if I see and -1 map it to 0 one by one. However, I was hoping that given the whole vector y = [N x 1] to map it in one step to a vector y = [N x 2] since can be quite large I wanted to do it as quickly as possible (I also didn't want to save the copy of the data set twice).
Is there a vectorized way to do this transformation quickly in python?
For the reference here is the looping code:
def transform_data_to_one_hot(X,Y):
N,D = Y.size
Y_new = np.zeros(N,D)
for i in range(N):
if y == -1:
Y_new[i] = np.array([1,0])
else:
Y_new[i] = np.array([0,1])
return Y_new
Lets do the parity function using Radamacher variables (i.e. +1,-1 instead of 0 and 1). In this case the parity function is just the product function:
>>> X = np.array([[-1,-1],[-1,1],[1,-1],[1,1]])
>>> X
array([[-1, -1],
[-1, 1],
[ 1, -1],
[ 1, 1]])
>>> Y = np.reshape(np.prod(X,axis=1),[4,1])
>>> Y
array([[ 1],
[-1],
[-1],
[ 1]])
the Y vector when is one hot should be:
>>> Y
array([[ 0,1],
[1,0],
[1,0],
[ 0,1]])
Here's one initialization based -
def initialization_based(y):
out = np.zeros((len(y),2),dtype=int)
out[np.arange(out.shape[0]), (y==1).astype(int)] = 1
return out
Sample run -
In [244]: y
Out[244]: array([ 1, -1, 1, 1, -1, 1, -1, 1])
In [245]: initialization_based(y)
Out[245]:
array([[0, 1],
[1, 0],
[0, 1],
[0, 1],
[1, 0],
[0, 1],
[1, 0],
[0, 1]])
Other ways to use initialization method -
def initialization_based_v2(y):
out = np.zeros((len(y),2),dtype=int)
out[np.arange(out.shape[0]), (y+1)//2] = 1
return out
def initialization_based_v3(y):
yc = y.copy()
yc[yc==-1] = 0
out = np.zeros((len(y),2),dtype=int)
out[np.arange(out.shape[0]), yc] = 1
return out
The two new additions only differ in the way we are setting up the column indices. For version 2, we have those computed with simply : (y+1)//2, while for the version 3 as : yc = y.copy(); yc[yc==-1] = 0.
Another one that gets pretty close to #Eric's one, but uses boolean array -
def initialization_based_v4(y):
out = np.empty((len(y),2),dtype=int)
mask = y == 1
out[:,0] = mask
out[:,1] = ~mask
return out
Runtime test -
In [320]: y = 2*np.random.randint(0,2,(1000000))-1
In [321]: %timeit sign_to_one_hot(y, dtype=int)
...: %timeit initialization_based(y)
...: %timeit initialization_based_v2(y)
...: %timeit initialization_based_v3(y)
...: %timeit initialization_based_v4(y)
...:
100 loops, best of 3: 3.16 ms per loop
100 loops, best of 3: 8.39 ms per loop
10 loops, best of 3: 27.2 ms per loop
100 loops, best of 3: 13.8 ms per loop
100 loops, best of 3: 3.11 ms per loop
In [322]: from sklearn.preprocessing import OneHotEncoder
In [323]: enc = OneHotEncoder(sparse=False)
In [324]: %timeit enc.fit_transform(np.where(y>=0, y, 0))
10 loops, best of 3: 77.3 ms per loop
A few simple observations to making this efficient:
Preallocate the result, rather than using concatenate
empty is faster than zeros if you're just going to overwrite those zeros
Use the out argument, to avoid temporaries
def sign_to_one_hot(x, dtype=np.float64):
out = np.empty(x.shape + (2,), dtype=dtype)
plus_one = out[...,0]
minus_one = out[...,1]
np.equal(x, 1, out=plus_one)
np.subtract(1, plus_one, out=minus_one)
return out
Choose your dtype carefully - casting because you chose the wrong one will incur a copy
You can also use sklearn.preprocessing.OneHotEncoder method.
NOTE: it doesn't accept negative numbers, so we have to replace them.
Demo:
from sklearn.preprocessing import OneHotEncoder
# per default it generates sparsed matrix - it might be very useful for huge data sets
enc = OneHotEncoder(sparse=False)
rslt = enc.fit_transform(np.where(Y>=0, Y, 0))
Result:
In [140]: rslt
Out[140]:
array([[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.]])
Source array:
In [141]: Y
Out[141]:
array([[ 1],
[-1],
[-1],
[ 1]])
Pandas solution:
In [148]: pd.get_dummies(Y.ravel())
Out[148]:
-1 1
0 0 1
1 1 0
2 1 0
3 0 1
I start with an array a containing N unique values (product(a.shape) >= N).
I need to find the array b that has the index 0 .. N-1 from the (sorted) list of unique values in a at the positions of the respective elements in a.
As an example
import numpy as np
np.random.seed(42)
a = np.random.choice([0.1,1.3,7,9.4], size=(4,3))
print a
prints a as
[[ 7. 9.4 0.1]
[ 7. 7. 9.4]
[ 0.1 0.1 7. ]
[ 1.3 7. 7. ]]
The unique values are [0.1, 1.3, 7.0, 9.4], so the required outcome b would be
[[2 3 0]
[2 2 3]
[0 0 2]
[1 2 2]]
(e.g. the value at a[0,0] is 7.; 7. has the index 2; thus b[0,0] == 2.)
Since numpy does not have an index function,
I could do this using a loop. Either looping over the input array, like this:
u = np.unique(a).tolist()
af = a.flatten()
b = np.empty(len(af), dtype=int)
for i in range(len(af)):
b[i] = u.index(af[i])
b = b.reshape(a.shape)
print b
or looping over the unique values as follows:
u = np.unique(a)
b = np.empty(a.shape, dtype=int)
for i in range(len(u)):
b[np.where(a == u[i])] = i
print b
I suppose that the second way of looping over the unique values is already more efficient than the first in cases where not all values in a are distinct; but still, it involves this loop and is rather inefficient compared to inplace operations.
So my question is: What is the most efficient way of obtaining the array b filled with the indizes of the unique values of a?
You could use np.unique with its optional argument return_inverse -
np.unique(a, return_inverse=1)[1].reshape(a.shape)
Sample run -
In [308]: a
Out[308]:
array([[ 7. , 9.4, 0.1],
[ 7. , 7. , 9.4],
[ 0.1, 0.1, 7. ],
[ 1.3, 7. , 7. ]])
In [309]: np.unique(a, return_inverse=1)[1].reshape(a.shape)
Out[309]:
array([[2, 3, 0],
[2, 2, 3],
[0, 0, 2],
[1, 2, 2]])
Going through the source code of np.unique that looks pretty efficient to me, but still pruning out the un-necessary parts, we would end up with another solution, like so -
def unique_return_inverse(a):
ar = a.flatten()
perm = ar.argsort()
aux = ar[perm]
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
iflag = np.cumsum(flag) - 1
inv_idx = np.empty(ar.shape, dtype=np.intp)
inv_idx[perm] = iflag
return inv_idx
Timings -
In [444]: a= np.random.randint(0,1000,(1000,400))
In [445]: np.allclose( np.unique(a, return_inverse=1)[1],unique_return_inverse(a))
Out[445]: True
In [446]: %timeit np.unique(a, return_inverse=1)[1]
10 loops, best of 3: 30.4 ms per loop
In [447]: %timeit unique_return_inverse(a)
10 loops, best of 3: 29.5 ms per loop
Not a great deal of improvement there over the built-in.