Is there a numpy equivalent to pandas.apply? - python

I have a call which adds some random values to a pandas Series:
series = series.apply(lambda x: int(math.ceil(x + x * rand_value(range))))
For performance reasons I can't use the pandas.Series anymore and have to use numpy arrays instead.
Imagine my 1D-array data is stored in a, how would I transform the call from above to numpy? I read about np.vectorize but I don't understand how I would use this with my lambda and self-made function to call.
My Idea:
func = np.vectorize(lambda x: int(math.ceil(x + x * rand_value(range))))
a = func(a)
At first glance it looks like that both calls result in the same output, but I am not sure about that. Could you confirm this?
And is there a better way, than using np.vectorize()?
Edit: rand_value(range) is defined like that:
def rand_value(range):
# create value between [-1; 1)
rand = np.random.rand()*2.0 - 1.0;
rand = (rand * float(range)) / 100.0
return rand
So I can't use np.ceil, because then my function will only get called once (?) and have always the same rand values, what I need, is that for every value in my array the function gets called.

You can get more than one random value by passing a shape to np.random.rand(). Once you have exactly as many random values as your input array, you can use plain numpy functions
import numpy as np
def rand_value(range, shape=None):
if shape is None:
shape = tuple()
rand = np.random.rand(*shape) * 2.0 - 1.0
rand = rand * range / 100.0
return rand
data = np.arange(16)
# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
rand_value(100.0, shape=data.shape)
# array([-0.0083601 , 0.90346962, -0.70813122, -0.73467017, 0.87514163,
# -0.29496392, 0.63828971, -0.10086984, -0.60248423, 0.26550601,
# -0.17577315, -0.95178997, 0.64123385, -0.54732105, 0.28590572,
# 0.19727859])
np.ceil(data + data * rand_value(100.0, shape=data.shape)).astype(int)
# array([ 0, 1, 4, 6, 8, 4, 9, 3, 4, 17, 10, 18, 16, 12, 16, 30])

You can create a randomized array using np.random.rand(). After that, you can apply to vectorize as pandas apply.
import numpy as np
np.random.rand(start_point, end_point) # generated array
If you need an example please follow the function created by you
import numpy as np
def rand_value(range):
# create value between [-1; 1)
rand = np.random.rand()*2.0 - 1.0;
rand = (rand * float(range)) / 100.0
return rand
Your function name is rand_value and your parameter is range. Now, we try to use vectorize with constant value as 5 or randint(low, high, size) random integer value, etc...
result_5 = np.vectorize(rand_value)(5)
result_rand = np.vectorize(rand_value)(np.random.randint(5,100,1))


Numpy convolving along an axis for 2 2D-arrays

I have 2 2D-arrays. I am trying to convolve along the axis 1. np.convolve doesn't provide the axis argument. The answer here, convolves 1 2D-array with a 1D array using np.apply_along_axis. But it cannot be directly applied to my use case. The question here doesn't have an answer.
MWE is as follows.
import numpy as np
a = np.random.randint(0, 5, (2, 5))
array([[4, 2, 0, 4, 3],
[2, 2, 2, 3, 1]])
b = np.random.randint(0, 5, (2, 2))
array([[4, 3],
[4, 0]])
# What I want
c = np.convolve(a, b, axis=1) # axis is not supported as an argument
array([[16, 20, 6, 16, 24, 9],
[ 8, 8, 8, 12, 4, 0]])
I know I can do it using np.fft.fft, but it seems like an unnecessary step to get a simple thing done. Is there a simple way to do this? Thanks.
Why not just do a list comprehension with zip?
>>> np.array([np.convolve(x, y) for x, y in zip(a, b)])
array([[16, 20, 6, 16, 24, 9],
[ 8, 8, 8, 12, 4, 0]])
Or with scipy.signal.convolve2d:
>>> from scipy.signal import convolve2d
>>> convolve2d(a, b)[[0, 2]]
array([[16, 20, 6, 16, 24, 9],
[ 8, 8, 8, 12, 4, 0]])
One possibility could be to manually go the way to the Fourier spectrum, and back:
n = np.max([a.shape, b.shape]) + 1
np.abs(np.fft.ifft(np.fft.fft(a, n=n) * np.fft.fft(b, n=n))).astype(int)
# array([[16, 20, 6, 16, 24, 9],
# [ 8, 8, 8, 12, 4, 0]])
Would it be considered too ugly to loop over the orthogonal dimension? That would not add much overhead unless the main dimension is very short. Creating the output array ahead of time ensures that no memory needs to be copied about.
def convolvesecond(a, b):
N1, L1 = a.shape
N2, L2 = b.shape
if N1 != N2:
raise ValueError("Not compatible")
c = np.zeros((N1, L1 + L2 - 1), dtype=a.dtype)
for n in range(N1):
c[n,:] = np.convolve(a[n,:], b[n,:], 'full')
return c
For the generic case (convolving along the k-th axis of a pair of multidimensional arrays), I would resort to a pair of helper functions I always keep on hand to convert multidimensional problems to the basic 2d case:
def semiflatten(x, d=0):
'''SEMIFLATTEN - Permute and reshape an array to convenient matrix form
y, s = SEMIFLATTEN(x, d) permutes and reshapes the arbitrary array X so
that input dimension D (default: 0) becomes the second dimension of the
output, and all other dimensions (if any) are combined into the first
dimension of the output. The output is always 2-D, even if the input is
only 1-D.
If D<0, dimensions are counted from the end.
Return value S can be used to invert the operation using SEMIUNFLATTEN.
This is useful to facilitate looping over arrays with unknown shape.'''
x = np.array(x)
shp = x.shape
ndims = x.ndim
if d<0:
d = ndims + d
perm = list(range(ndims))
y = np.transpose(x, perm)
# Y has the original D-th axis last, preceded by the other axes, in order
rest = np.array(shp, int)[perm[:-1]]
y = np.reshape(y, [, y.shape[-1]])
return y, (d, rest)
def semiunflatten(y, s):
'''SEMIUNFLATTEN - Reverse the operation of SEMIFLATTEN
x = SEMIUNFLATTEN(y, s), where Y, S are as returned from SEMIFLATTEN,
reverses the reshaping and permutation.'''
d, rest = s
x = np.reshape(y, np.append(rest, y.shape[-1]))
perm = list(range(x.ndim))
perm.insert(d, x.ndim-1)
x = np.transpose(x, perm)
return x
(Note that reshape and transpose do not create copies, so these functions are extremely fast.)
With those, the generic form can be written as:
def convolvealong(a, b, axis=-1):
a, S1 = semiflatten(a, axis)
b, S2 = semiflatten(b, axis)
c = convolvesecond(a, b)
return semiunflatten(c, S1)

numpy create array of the max of consecutive pairs in another array

I have a numpy array:
A = np.array([8, 2, 33, 4, 3, 6])
What I want is to create another array B where each element is the pairwise max of 2 consecutive pairs in A, so I get:
B = np.array([8, 33, 33, 4, 6])
Any ideas on how to implement?
Any ideas on how to implement this for more then 2 elements? (same thing but for consecutive n elements)
The answers gave me a way to solve this question, but for the n-size window case, is there a more efficient way that does not require loops?
Turns out that the question is equivalent for asking how to perform 1d max-pooling of a list with a window of size n.
Does anyone know how to implement this efficiently?
One solution to the pairwise problem is using the np.maximum function and array slicing:
B = np.maximum(A[:-1], A[1:])
A loop-free solution is to use max on the windows created by skimage.util.view_as_windows:
list(map(max, view_as_windows(A, (2,))))
[8, 33, 33, 4, 6]
Copy/pastable example:
import numpy as np
from skimage.util import view_as_windows
A = np.array([8, 2, 33, 4, 3, 6])
list(map(max, view_as_windows(A, (2,))))
Here is an approach specifically taylored for larger windows. It is O(1) in window size and O(n) in data size.
I've done a pure numpy and a pythran implementation.
How do we achieve O(1) in window size? We use a "sawtooth" trick: If w is the window width we group the data into lots of w and for each group we do the cumulative maximum from left to right and from right to left. The elements of any in-between window distribute over two groups and the maxima of the intersections are among the cumulative maxima we have computed earlier. So we need a total of 3 comparisons per data point.
benchit (thanks #Divakar) for w=100; my functions are pp (numpy) and winmax (pythran):
For small window size w=5 the picture is more even. Interestingly, pythran still has a huge edge even for very small sizes. They must be doing something right to mimimze call overhead.
python code:
cummax = np.maximum.accumulate
def pp(a,w):
N = a.size//w
if a.size-w+1 > N*w:
out = np.empty(a.size-w+1,a.dtype)
out[:-1] = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-1:-1]
out[-1] = a[w*N:].max()
out = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-2:-1]
out[1:N*w-w+1] = np.maximum(out[1:N*w-w+1],
out[N*w-w+1:] = np.maximum(out[N*w-w+1:],cummax(a[N*w:]))
return out
pythran version; compile with pythran -O3 <>; this creates a compiled module which you can import:
import numpy as np
# pythran export winmax(float[:],int)
# pythran export winmax(int[:],int)
def winmax(data,winsz):
N = data.size//winsz
if N < 1:
raise ValueError
out = np.empty(data.size-winsz+1,data.dtype)
nxt = winsz
for j in range(winsz,data.size):
if j == nxt:
nxt += winsz
out[j+1-winsz] = data[j]
out[j+1-winsz] = out[j-winsz] if out[j-winsz]>data[j] else data[j]
running = data[-winsz:N*winsz].max()
nxt -= winsz << (nxt > data.size)
for j in range(data.size-winsz,0,-1):
if j == nxt:
nxt -= winsz
running = data[j-1]
running = data[j] if data[j] > running else running
out[j] = out[j] if out[j] > running else running
out[0] = data[0] if data[0] > running else running
return out
In this Q&A, we are basically asking for sliding max values. This has been explored before - Max in a sliding window in NumPy array. Since, we are looking to be efficient, we can look further. One of those would be numba and here are two final variants I ended up with that leverage parallel directive that boosts performance over a without version :
import numpy as np
from numba import njit, prange
def numba1(a, W):
L = len(a)-W+1
out = np.empty(L, dtype=a.dtype)
v = np.iinfo(a.dtype).min
for i in prange(L):
max1 = v
for j in range(W):
cur = a[i + j]
if cur>max1:
max1 = cur
out[i] = max1
return out
def numba2(a, W):
L = len(a)-W+1
out = np.empty(L, dtype=a.dtype)
for i in prange(L):
for j in range(W):
cur = a[i + j]
if cur>out[i]:
out[i] = cur
return out
From the earlier linked Q&A, the equivalent SciPy version would be -
from scipy.ndimage.filters import maximum_filter1d
def scipy_max_filter1d(a, W):
L = len(a)-W+1
hW = W//2 # Half window size
return maximum_filter1d(a,size=W)[hW:hW+L]
Other posted working approaches for generic window arg :
from skimage.util import view_as_windows
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
# #mathfux's soln
def npmax_strided(a,n):
return np.max(rolling(a, n), axis=1)
# #Nicolas Gervais's soln
def mapmax_strided(a, W):
return list(map(max, view_as_windows(a,W)))
cummax = np.maximum.accumulate
def pp(a,w):
N = a.size//w
if a.size-w+1 > N*w:
out = np.empty(a.size-w+1,a.dtype)
out[:-1] = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-1:-1]
out[-1] = a[w*N:].max()
out = cummax(a[w*N-1::-1].reshape(N,w),axis=1).ravel()[:w-a.size-2:-1]
out[1:N*w-w+1] = np.maximum(out[1:N*w-w+1],
out[N*w-w+1:] = np.maximum(out[N*w-w+1:],cummax(a[N*w:]))
return out
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
import benchit
funcs = [mapmax_strided, npmax_strided, numba1, numba2, scipy_max_filter1d, pp]
in_ = {(n,W):(np.random.randint(0,100,n),W) for n in 10**np.arange(2,6) for W in [2, 10, 20, 50, 100]}
t = benchit.timings(funcs, in_, multivar=True, input_name=['Array-length', 'Window-length'])
t.plot(logx=True, sp_ncols=1, save='timings.png')
So, numba ones are great for window sizes lower than 10, at which there's no clear winner and on larger window sizes pp wins with SciPy one at second spot.
In case there are consecutive n items, extended solution requires looping:
np.maximum(*[A[i:len(A)-n+i+1] for i in range(n)])
In order to avoid it you can use stride tricks and convert A to array of n-length blocks:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
For example:
>>> rolling(A, 3)
array([[ 8, 2, 8],
[ 2, 8, 33],
[ 8, 33, 33],
[33, 33, 4]])
After it's done you can kill it with np.max(rolling(A, n), axis=1).
Though, despite its elegance, neither this solution nor first one were not efficient because we apply repeatedly maximum on adjacent blocks that differs by two items only.
a recursive solution, for all of n
import numpy as np
import sys
def recursive(a: np.ndarray, n: int, b=None, level=2):
if n <= 0 or n > len(a):
raise ValueError(f'len(a):{len(a)} n:{n}')
if n == 1:
return a
if len(a) == n:
return np.max(a)
b = np.maximum(a[:-1], a[1:]) if b is None else np.maximum(a[level - 1:], b)
if n == level:
return b
return recursive(a, n, b[:-1], level + 1)
test_data = np.array([8, 2, 33, 4, 3, 6])
for test_n in range(1, len(test_data) + 2):
print(recursive(test_data, n=test_n))
except ValueError as e:
[ 8 2 33 4 3 6]
[ 8 33 33 4 6]
[33 33 33 6]
[33 33 33]
[33 33]
len(a):6 n:7
about recursive function
You can observe the following data, and then you will know how to write the recursive function.
np.array([8, 2, 33, 4, 3, 6])
n=2: (8, 2), (2, 33), (33, 4), (4, 3), (3, 6) => [8, 33, 33, 4, 6] => B' = [8, 33, 33, 4]
n=3: (8, 2, 33), (2, 33, 4), (33, 4, 3), (4, 3, 6) => B' [33, 4, 3, 6] => np.maximum([8, 33, 33, 4], [33, 4, 3, 6]) => 33, 33, 33, 6
Using Pandas:
A = pd.Series([8, 2, 33, 4, 3, 6])
res = pd.concat([A,A.shift(-1)],axis=1).max(axis=1,skipna=False).dropna()
0 8.0
1 33.0
2 33.0
3 4.0
4 6.0
Or using numpy:

Optimize the python function with numpy without using the for loop

I have the following python function:
def npnearest(u: np.ndarray, X: np.ndarray, Y: np.ndarray, distance: 'callbale'=npdistance):
Finds x1 so that x1 is in X and u and x1 have a minimal distance (according to the
provided distance function) compared to all other data points in X. Returns the label of x1
u (np.ndarray): The vector (ndim=1) we want to classify
X (np.ndarray): A matrix (ndim=2) with training data points (vectors)
Y (np.ndarray): A vector containing the label of each data point in X
distance (callable): A function that receives two inputs and defines the distance function used
int: The label of the data point which is closest to `u`
xbest = None
ybest = None
dbest = float('inf')
for x, y in zip(X, Y):
d = distance(u, x)
if d < dbest:
ybest = y
xbest = x
dbest = d
return ybest
Where, npdistance simply gives distance between two points i.e.
def npdistance(x1, x2):
I want to optimize npnearest by performing nearest neighbor search directly in numpy. This means that the function cannot use for/while loops.
Since you don't need to use that exact function, you can simply change the sum to work over a particular axis. This will return a new list with the calculations and you can call argmin to get the index of the minimum value. Use that and lookup your label:
import numpy as np
def npdistance_idx(x1, x2):
return np.argmin(np.sum((x1-x2)**2, axis=1))
Y = ["label 0", "label 1", "label 2", "label 3"]
u = np.array([[1, 5.5]])
X = np.array([[1,2], [1, 5], [0, 0], [7, 7]])
idx = npdistance_idx(X, u)
print(Y[idx]) # label 1
Numpy supports vectorized operations (broadcasting)
This means you can pass in arrays and operations will be applied to entire arrays in an optimized way (SIMD - single instruction, multiple data)
You can then get the address of the array minimum using .argmin()
Hope this helps
In [9]: numbers = np.arange(10); numbers
Out[9]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [10]: numbers -= 5; numbers
Out[10]: array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4])
In [11]: numbers = np.power(numbers, 2); numbers
Out[11]: array([25, 16, 9, 4, 1, 0, 1, 4, 9, 16])
In [12]: numbers.argmin()
Out[12]: 5

parallelize zonal computation on numpy array

I try to compute mode on all cells of the same zone (same value) on a numpy array. I give you an example of code below. In this example sequential approach works fine but multiprocessed approach does nothing. I do not find my mistake.
Does someone see my error ?
I would like to parallelize the computation because my real array is a 10k * 10k array with 1M zones.
import numpy as np
import scipy.stats as ss
import multiprocessing as mp
def zone_mode(i, a, b, output):
to_extract = np.where(a == i)
val = b[to_extract]
output[to_extract] = ss.mode(val)[0][0]
return output
def zone_mode0(i, a, b):
to_extract = np.where(a == i)
val = b[to_extract]
output = ss.mode(val)[0][0]
return output
zone = np.array([[1, 1, 1, 2, 3],
[1, 1, 2, 2, 3],
[4, 2, 2, 3, 3],
[4, 4, 5, 5, 3],
[4, 6, 6, 5, 5],
[6, 6, 6, 5, 5]])
values = np.random.randint(8, size=zone.shape)
output = np.zeros_like(zone).astype(np.float)
for i in np.unique(zone):
output = zone_mode(i, zone, values, output)
# for multiprocessing
zone0 = zone - 1
pool = mp.Pool(mp.cpu_count() - 1)
results = [pool.apply(zone_mode0, args=(u, zone0, values)) for u in np.unique(zone0)]
output = results[zone0]
For positve integers in the arrays - zone and values, we can use np.bincount. The basic idea is that we will consider zone and values as row and cols on a 2D grid. So, can map those to their linear index equivalent numbers. Those would be used as bins for binned summation with np.bincount. Their argmax IDs would be the mode numbers. They are mapped back to zone-grid with indexing into zone.
Hence, the solution would be -
m = zone.max()+1
n = values.max()+1
ids = zone*n + values
c = np.bincount(ids.ravel(),minlength=m*n).reshape(-1,n).argmax(1)
out = c[zone]
For sparsey data (well spread integers in the input arrays), we can look into sparse-matrix to get the argmax IDs c. Hence, with SciPy's sparse-matrix -
from scipy.sparse import coo_matrix
data = np.ones(zone.size,dtype=int)
r,c = zone.ravel(),values.ravel()
c = coo_matrix((data,(r,c))).argmax(1).A1
For slight perf. boost, specify the shape -
c = coo_matrix((data,(r,c)),shape=(m,n)).argmax(1).A1
Solving for generic values
We will make use of pandas.factorize, like so -
import pandas as pd
ids,unq = pd.factorize(values.flat)
v = ids.reshape(values.shape)
# .. same steps as earlier with bincount, using v in place of values
out = unq[c[zone]]
Note that for tie-cases, it would pick random element off values. If you want to pick the first one, use pd.factorize(values.flat, sort=True).

Why is map_block function run twice?

I have a question why is map_block function run twice? When I run an example below:
import dask.array as da
import numpy as np
def derivative(x):
return x - np.roll(x, 1)
x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
d = da.from_array(x, chunks = 5)
y = d.map_blocks(derivative)
res = y.compute()
I obtain this output:
Since my chunks are ((5, 4),), I assume that derivative function has to be somehow run once before is really executed on these chunks, am I right?
I have python v2.7 and dask on v0.13.0.
If you do not supply a dtype to the map-blocks call then it will try running your function on a tiny sample dataset (hence the singleton shape). You can avoid this by passing a dtype explicitly if you know it.
y = d.map_blocks(derivative, dtype=d.dtype)
