Why is map_block function run twice? - python

I have a question why is map_block function run twice? When I run an example below:
import dask.array as da
import numpy as np
def derivative(x):
print(x.shape)
return x - np.roll(x, 1)
x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
d = da.from_array(x, chunks = 5)
y = d.map_blocks(derivative)
res = y.compute()
I obtain this output:
(1L,)
(5L,)
(4L,)
Since my chunks are ((5, 4),), I assume that derivative function has to be somehow run once before is really executed on these chunks, am I right?
I have python v2.7 and dask on v0.13.0.

If you do not supply a dtype to the map-blocks call then it will try running your function on a tiny sample dataset (hence the singleton shape). You can avoid this by passing a dtype explicitly if you know it.
y = d.map_blocks(derivative, dtype=d.dtype)

Related

Is there a numpy equivalent to pandas.apply?

I have a call which adds some random values to a pandas Series:
series = series.apply(lambda x: int(math.ceil(x + x * rand_value(range))))
For performance reasons I can't use the pandas.Series anymore and have to use numpy arrays instead.
Imagine my 1D-array data is stored in a, how would I transform the call from above to numpy? I read about np.vectorize but I don't understand how I would use this with my lambda and self-made function to call.
My Idea:
func = np.vectorize(lambda x: int(math.ceil(x + x * rand_value(range))))
a = func(a)
At first glance it looks like that both calls result in the same output, but I am not sure about that. Could you confirm this?
And is there a better way, than using np.vectorize()?
Edit: rand_value(range) is defined like that:
def rand_value(range):
# create value between [-1; 1)
rand = np.random.rand()*2.0 - 1.0;
rand = (rand * float(range)) / 100.0
return rand
So I can't use np.ceil, because then my function will only get called once (?) and have always the same rand values, what I need, is that for every value in my array the function gets called.
You can get more than one random value by passing a shape to np.random.rand(). Once you have exactly as many random values as your input array, you can use plain numpy functions
import numpy as np
def rand_value(range, shape=None):
if shape is None:
shape = tuple()
rand = np.random.rand(*shape) * 2.0 - 1.0
rand = rand * range / 100.0
return rand
data = np.arange(16)
# array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
rand_value(100.0, shape=data.shape)
# array([-0.0083601 , 0.90346962, -0.70813122, -0.73467017, 0.87514163,
# -0.29496392, 0.63828971, -0.10086984, -0.60248423, 0.26550601,
# -0.17577315, -0.95178997, 0.64123385, -0.54732105, 0.28590572,
# 0.19727859])
np.ceil(data + data * rand_value(100.0, shape=data.shape)).astype(int)
# array([ 0, 1, 4, 6, 8, 4, 9, 3, 4, 17, 10, 18, 16, 12, 16, 30])
You can create a randomized array using np.random.rand(). After that, you can apply to vectorize as pandas apply.
import numpy as np
np.random.rand(start_point, end_point) # generated array
np.vectorize(*your_function*)(*your_function_parameters*)
If you need an example please follow the function created by you
import numpy as np
def rand_value(range):
# create value between [-1; 1)
rand = np.random.rand()*2.0 - 1.0;
rand = (rand * float(range)) / 100.0
return rand
Your function name is rand_value and your parameter is range. Now, we try to use vectorize with constant value as 5 or randint(low, high, size) random integer value, etc...
result_5 = np.vectorize(rand_value)(5)
result_rand = np.vectorize(rand_value)(np.random.randint(5,100,1))

parallelize zonal computation on numpy array

I try to compute mode on all cells of the same zone (same value) on a numpy array. I give you an example of code below. In this example sequential approach works fine but multiprocessed approach does nothing. I do not find my mistake.
Does someone see my error ?
I would like to parallelize the computation because my real array is a 10k * 10k array with 1M zones.
import numpy as np
import scipy.stats as ss
import multiprocessing as mp
def zone_mode(i, a, b, output):
to_extract = np.where(a == i)
val = b[to_extract]
output[to_extract] = ss.mode(val)[0][0]
return output
def zone_mode0(i, a, b):
to_extract = np.where(a == i)
val = b[to_extract]
output = ss.mode(val)[0][0]
return output
np.random.seed(1)
zone = np.array([[1, 1, 1, 2, 3],
[1, 1, 2, 2, 3],
[4, 2, 2, 3, 3],
[4, 4, 5, 5, 3],
[4, 6, 6, 5, 5],
[6, 6, 6, 5, 5]])
values = np.random.randint(8, size=zone.shape)
output = np.zeros_like(zone).astype(np.float)
for i in np.unique(zone):
output = zone_mode(i, zone, values, output)
# for multiprocessing
zone0 = zone - 1
pool = mp.Pool(mp.cpu_count() - 1)
results = [pool.apply(zone_mode0, args=(u, zone0, values)) for u in np.unique(zone0)]
pool.close()
output = results[zone0]
For positve integers in the arrays - zone and values, we can use np.bincount. The basic idea is that we will consider zone and values as row and cols on a 2D grid. So, can map those to their linear index equivalent numbers. Those would be used as bins for binned summation with np.bincount. Their argmax IDs would be the mode numbers. They are mapped back to zone-grid with indexing into zone.
Hence, the solution would be -
m = zone.max()+1
n = values.max()+1
ids = zone*n + values
c = np.bincount(ids.ravel(),minlength=m*n).reshape(-1,n).argmax(1)
out = c[zone]
For sparsey data (well spread integers in the input arrays), we can look into sparse-matrix to get the argmax IDs c. Hence, with SciPy's sparse-matrix -
from scipy.sparse import coo_matrix
data = np.ones(zone.size,dtype=int)
r,c = zone.ravel(),values.ravel()
c = coo_matrix((data,(r,c))).argmax(1).A1
For slight perf. boost, specify the shape -
c = coo_matrix((data,(r,c)),shape=(m,n)).argmax(1).A1
Solving for generic values
We will make use of pandas.factorize, like so -
import pandas as pd
ids,unq = pd.factorize(values.flat)
v = ids.reshape(values.shape)
# .. same steps as earlier with bincount, using v in place of values
out = unq[c[zone]]
Note that for tie-cases, it would pick random element off values. If you want to pick the first one, use pd.factorize(values.flat, sort=True).

array/list/tuple of nparrays with variable length?

In Python, I have the following problem, made into a toy example:
import random
import numpy as np
x_arr = np.array([], dtype = object)
for x in range(5):
y_arr = np.array([], dtype=object)
for y in range(5):
r = random.random()
if r < 0.5:
y_arr = np.append(y_arr,y)
if random.random() < 0.9:
x_arr = np.append(x_arr, y_arr)
#This results in
>>> x_arr
array([4, 0, 1, 2, 4, 0, 3, 4], dtype=object)
I would like to have
array([array([4]), array([0, 1, 2, 4]), array([0, 3, 4]), dtype=object)
So apparently, in this run 3 out of 5 (variable) times the array $y_arr$ is written into $x_arr$, having lengths 1,4, and 3 (variable).
append() puts the results in one long 1D-structure, where I would like to keep it 2D. Also, considering the example, it might be that no numbers get written at all (if you are 'unlucky' with the random numbers). So i have an a priori unknown array of arrays with, each of those, a priori unknown number of elements. How would I approach this in Python, other than finding an upperbound on both and store a lot of zeros?
You might do it in a two step process? First add an element, then set the element. This circumvents the automatic flatten which happens in np.append() when axis=None (default behavior), as documented here.
import random
import numpy as np
x_arr = np.array([], dtype = object).reshape((1,0))
for x in range(5):
y_arr = np.array([], dtype=np.int32)
for y in range(5):
r = random.random()
if r < 0.5:
y_arr = np.append(y_arr,y)
if random.random() < 0.9:
x_arr = np.append(x_arr, 0)
x_arr[-1] = y_arr
print type(x_arr)
print x_arr
This gives:
<type 'numpy.ndarray'>
[array([0, 1, 2]) array([0, 1, 2, 3]) array([0, 1, 4]) array([0, 1, 3, 4])
array([2, 3])]
Also, why not use a python list for x_arr (or y_arr?). Nested numpy arrays are not really useful when they are not ndarrays.

Python NumPy: How to fill a matrix using an equation

I wish to initialise a matrix A, using the equation A_i,j = f(i,j) for some f (It's not important what this is).
How can I do so concisely avoiding a situation where I have two for loops?
numpy.fromfunction fits the bill here.
Example from doc:
>>> import numpy as np
>>> np.fromfunction(lambda i, j: i + j, (3, 3), dtype=int)
array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
One could also get the indexes of your array with numpy.indices and then apply the function f in a vectorized fashion,
import numpy as np
shape = 1000, 1000
Xi, Yj = np.indices(shape)
A = (2*Xi + 3*Yj).astype(np.int) # or any other function f(Xi, Yj)

generalized cumulative functions in NumPy/SciPy?

Is there a function in numpy or scipy (or some other library) that generalizes the idea of cumsum and cumprod to arbitrary function. For example, consider the (theoretical) function
cumf( func, array)
func is a function that accepts two floats, and returns a float. Particular cases
lambda x,y: x+y
and
lambda x,y: x*y
are cumsum and cumprod respectively. For example, if
func = lambda x,prev_x: x^2*prev_x
and I apply it to:
cumf(func, np.array( 1, 2, 3) )
I would like
np.array( 1, 4, 9*4 )
The ValueError above is still a bug using Numpy 1.20.1 (with Python 3.9.1).
Luckily a workaround was discovered that uses casting:
https://groups.google.com/forum/#!topic/numpy/JgUltPe2hqw
import numpy as np
uadd = np.frompyfunc(lambda x, y: x + y, 2, 1)
uadd.accumulate([1,2,3], dtype=object).astype(int)
# array([1, 3, 6])
Note that since the custom operation works on an object type, it won't benefit from the efficient memory management of numpy. So the operation may be slower than one that didn't need casting to object for extremely large arrays.
NumPy's ufuncs have accumulate():
In [22]: np.multiply.accumulate([[1, 2, 3], [4, 5, 6]], axis=1)
Out[22]:
array([[ 1, 2, 6],
[ 4, 20, 120]])
Unfortunately, calling accumulate() on a frompyfunc()'ed Python function fails with a strange error:
In [32]: uadd = np.frompyfunc(lambda x, y: x + y, 2, 1)
In [33]: uadd.accumulate([1, 2, 3])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ValueError: could not find a matching type for <lambda> (vectorized).accumulate,
requested type has type code 'l'
This is using NumPy 1.6.1 with Python 2.7.3.

Categories