Can this code with NumPy calculations be more efficient? - python

I want to convert array a to log_e. If the number to be converted is non-positive, then convert it to 0:
import numpy as np
a = np.array([-1,0,1,2])
b = np.zeros(len(a))
for i in range(0,len(a)):
if a[i] <= 0:
b[i] = 0
else:
b[i] = np.log(a[i])
To improve the computing performance, I think the following is better. But then the error RuntimeWarning: divide by zero encountered in log pops out. Can I use some code to carry on with my expected calculations?
import numpy as np
a = np.array([0,0,1,2])
b = np.log(a)

use np.where on a to mask non-positive number with 1, then np.log:
b = np.log(np.where(a>0, a, 1))
Output:
array([0. , 0. , 0. , 0.69314718])

As a "ufunc", numpy.log accepts the parameters where and out. So an efficient method for your computation is as follows.
In [6]: a = np.array([-1, 0, 1, 2])
Create the output array.
In [7]: b = np.zeros(len(a))
Tell numpy.log to only compute the result where a > 0, and put the output in b. This returns the array given as out, and modifies out (i.e. b) in-place.
In [8]: np.log(a, where=a > 0, out=b)
Out[8]: array([0. , 0. , 0. , 0.69314718])
In [9]: b
Out[9]: array([0. , 0. , 0. , 0.69314718])

Related

Min-max scaling along rows in numpy array

I have a numpy array and I want to rescale values along each row to values between 0 and 1 using the following procedure:
If the maximum value along a given row is X_max and the minimum value along that row is X_min, then the rescaled value (X_rescaled) of a given entry (X) in that row should become:
X_rescaled = (X - X_min)/(X_max - X_min)
As an example, let's consider the following array (arr):
arr = np.array([[1.0,2.0,3.0],[0.1, 5.1, 100.1],[0.01, 20.1, 1000.1]])
print arr
array([[ 1.00000000e+00, 2.00000000e+00, 3.00000000e+00],
[ 1.00000000e-01, 5.10000000e+00, 1.00100000e+02],
[ 1.00000000e-02, 2.01000000e+01, 1.00010000e+03]])
Presently, I am trying to use MinMaxscaler from scikit-learn in the following way:
from sklearn.preprocessing import MinMaxScaler
result = MinMaxScaler(arr)
But, I keep getting my initial array, i.e. result turns out to be the same as arr in the aforementioned method. What am I doing wrong?
How can I scale the array arr in the manner that I require (min-max scaling along each axis?) Thanks in advance.
MinMaxScaler is a bit clunky to use; sklearn.preprocessing.minmax_scale is more convenient. This operates along columns, so use the transpose:
>>> import numpy as np
>>> from sklearn import preprocessing
>>>
>>> a = np.random.random((3,5))
>>> a
array([[0.80161048, 0.99572497, 0.45944366, 0.17338664, 0.07627295],
[0.54467986, 0.8059851 , 0.72999058, 0.08819178, 0.31421126],
[0.51774372, 0.6958269 , 0.62931078, 0.58075685, 0.57161181]])
>>> preprocessing.minmax_scale(a.T).T
array([[0.78888024, 1. , 0.41673812, 0.10562126, 0. ],
[0.63596033, 1. , 0.89412757, 0. , 0.314881 ],
[0. , 1. , 0.62648851, 0.35384099, 0.30248836]])
>>>
>>> b = np.array([(4, 1, 5, 3), (0, 1.5, 1, 3)])
>>> preprocessing.minmax_scale(b.T).T
array([[0.75 , 0. , 1. , 0.5 ],
[0. , 0.5 , 0.33333333, 1. ]])

How the speed up the calculation of updating values of each element in an array based on other elements in Python?

I am now working on a calculation shown below. I want to update the values of each element based on their adjacent elements. I am now using two for loops, but it shows the calculation is very slow since there are several outer iterations. I want to know whether there is any way can speed up this calculation>
for i in range(1,nx+1):
for j in range(1,ny+1):
p[i,j]=(a*p[i-1,j]+b*p[i+1,j]+c*p[i,j-1]+d*p[i,j+1])
a, b, c, d are some constant, p is numpy.array type
Sample input:
import numpy as np
p = np.ones((5,5))
for i in range(1,4):
for j in range(1,4):
p[i,j]=p[i-1,j] + p[i+1,j] +2*p[i,j+1]+2*p[i,j-1]
print(p)
The final output should be:
[[ 1. 1. 1. 1. 1.]
[ 1. 6. 16. 36. 1.]
[ 1. 11. 41. 121. 1.]
[ 1. 16. 76. 276. 1.]
[ 1. 1. 1. 1. 1.]]
Don't have enough rep to comment and this doesn't fully answer the question, but if you are using NumPy, you should definitely look at array broadcasting. Hard to tell exactly what your code is doing, but using broadcasting should make it a lot easier to update the full matrix instead of value by value
We can at least get rid of one nested loop using np.cumsum. In favorable conditions (large number of columns) this can give a 30fold speedup. Sample run:
results equal True
original 31.644793 ms
optimized 0.861980 ms
Code:
import numpy as np
n, m = 50, 600
a, b, c, d = np.random.random((4,))
P = np.random.random((n, m))
def f_OP(P):
p = P.copy()
for i in range(1, n-1):
for j in range(1, m-1):
p[i,j]=a*p[i-1,j] + b*p[i+1,j] +c*p[i,j-1]+d*p[i,j+1]
return p
def f_pp(P):
p = P.copy()
pp = d*p[1:-1, 2:] + b*p[2:, 1:-1]
pp[0] += a*p[0, 1:-1]
pp[:, 0] += c*p[1:-1, 0]
x = np.full((m-2,), c)
x[0] = 1
x = np.cumprod(x)[::-1]
pp = np.cumsum(pp * x, axis=1)
for i in range(1, n-2):
pp[i] += a * np.cumsum(pp[i-1])
p[1:-1, 1:-1] = pp / x
return(p)
print('results equal', np.allclose(f_OP(P), f_pp(P)))
from timeit import timeit
kwds = dict(globals=globals(), number=10)
print('original {:10.6f} ms'.format(timeit('f_OP(P)', **kwds)*100))
print('optimized {:10.6f} ms'.format(timeit('f_pp(P)', **kwds)*100))

How to efficiently apply functions to values in an array based on condition?

I have an array arorg like this:
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
and another array values that looks as follows:
values = np.array([1., 0., 2.])
values has the same number of entries as arorg has columns.
Now I want to apply functions to the entries or arorg depending on whether they are positive or negative:
def neg_fun(val1, val2):
return val1 / (val1 + abs(val2))
def pos_fun(val1, val2):
return 1. / ((val1 / val2) + 1.)
Thereby, val2 is the (absolute) value in arorg and val1 - this is the tricky part - comes from values; if I apply pos_fun and neg_fun to column i in arorg, val1 should be values[i].
I currently implement that as follows:
ar = arorg.copy()
for (x, y) in zip(*np.where(ar > 0)):
ar.itemset((x, y), pos_fun(values[y], ar.item(x, y)))
for (x, y) in zip(*np.where(ar < 0)):
ar.itemset((x, y), neg_fun(values[y], ar.item(x, y)))
which gives me the desired output:
array([[ 0.5 , 1. , 0.33333333],
[ 0.33333333, 0. , 0.6 ]])
As I have to do these calculations very often, I am wondering whether there is a more efficient way of doing this. Something like
np.where(arorg > 0, pos_fun(xxxx), arorg)
would be great but I don't know how to pass the arguments correctly (the xxx). Any suggestions?
As hinted in the question, here's one using np.where.
First off, we are using a direct translation of the function implementation to generate values/arrays for both positive and negative cases. Then, with a mask of positive values, we will choose between those two arrays using np.where.
Thus, the implementation would look something along these lines -
# Get positive and negative values for all elements
val1 = values
val2 = arorg
neg_vals = val1 / (val1 + np.abs(val2))
pos_vals = 1. / ((val1 / val2) + 1.)
# Get a positive mask and choose between positive and negative values
pos_mask = arorg > 0
out = np.where(pos_mask, pos_vals, neg_vals)
You don't need to apply function to zipped elements of arrays, you can accomplish the same thing through simple array operations and slicing.
First, get the positive and negative calculation, saved as arrays. Then create a return array of zeros (just as a default value), and populate it using boolean slices of pos and neg:
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
values = np.array([1., 0., 2.])
pos = 1. / ((values / arorg) + 1)
neg = values / (values + np.abs(arorg))
ret = np.zeros_like(arorg)
ret[arorg>0] = pos[arorg>0]
ret[arorg<=0] = neg[arorg<=0]
ret
# returns:
array([[ 0.5 , 1. , 0.33333333],
[ 0.33333333, 0. , 0.6 ]])
import numpy as np
arorg = np.array([[-1., 2., -4.], [0.5, -1.5, 3]])
values = np.array([1., 0., 2.])
p = 1.0/(values/arorg+1)
n = values/(values+abs(arorg))
#using np.place to extract negative values and put them to p
np.place(p,arorg<0,n[arorg<0])
print(p)
[[ 0.5 1. 0.33333333]
[ 0.33333333 0. 0.6 ]]

Conditional numpy cumulative sum

I'm looking for a way to calculate the cumulative sum with numpy, but don't want to roll forward the value (or set it to zero) in case the cumulative sum is very close to zero and negative.
For instance
a = np.asarray([0, 4999, -5000, 1000])
np.cumsum(a)
returns [0, 4999, -1, 999]
but, I'd like to set the [2]-value (-1) to zero during the calculation. The problem is that this decision can only be done during calculation as the intermediate result isn't know a priori.
The expected array is: [0, 4999, 0, 1000]
The reason for this is that I'm getting very small values (floating point, not integers as in the example) which are due to floating point calculations which should in reality be zero. Calculating the cumulative sum compounds those values which leads to errors.
The Kahan summation algorithm could solve the problem. Unfortunately, it is not implemented in numpy. This means a custom implementation is required:
def kahan_cumsum(x):
x = np.asarray(x)
cumulator = np.zeros_like(x)
compensation = 0.0
cumulator[0] = x[0]
for i in range(1, len(x)):
y = x[i] - compensation
t = cumulator[i - 1] + y
compensation = (t - cumulator[i - 1]) - y
cumulator[i] = t
return cumulator
I have to admit, this is not exactly what was asked for in the question. (A value of -1 at the 3rd output of the cumsum is correct in the example). However, I hope this solves the actual problem behind the question, which is related to floating point precision.
I wonder if rounding will do what you are asking for:
np.cumsum(np.around(a,-1))
# the -1 means it rounds to the nearest 10
gives
array([ 0, 5000, 0, 1000])
It is not exactly as you put in your expected array from your answer, but using around, perhaps with the decimals parameter set to 0, might work when you apply it to the problem with floats.
Probably the best way to go is to write this bit in Cython (name the file cumsum_eps.pyx):
cimport numpy as cnp
import numpy as np
cdef inline _cumsum_eps_f4(float *A, int ndim, int dims[], float *out, float eps):
cdef float sum
cdef size_t ofs
N = 1
for i in xrange(0, ndim - 1):
N *= dims[i]
ofs = 0
for i in xrange(0, N):
sum = 0
for k in xrange(0, dims[ndim-1]):
sum += A[ofs]
if abs(sum) < eps:
sum = 0
out[ofs] = sum
ofs += 1
def cumsum_eps_f4(cnp.ndarray[cnp.float32_t, mode='c'] A, shape, float eps):
cdef cnp.ndarray[cnp.float32_t] _out
cdef cnp.ndarray[cnp.int_t] _shape
N = np.prod(shape)
out = np.zeros(N, dtype=np.float32)
_out = <cnp.ndarray[cnp.float32_t]> out
_shape = <cnp.ndarray[cnp.int_t]> np.array(shape, dtype=np.int)
_cumsum_eps_f4(&A[0], len(shape), <int*> &_shape[0], &_out[0], eps)
return out.reshape(shape)
def cumsum_eps(A, axis=None, eps=np.finfo('float').eps):
A = np.array(A)
if axis is None:
A = np.ravel(A)
else:
axes = list(xrange(len(A.shape)))
axes[axis], axes[-1] = axes[-1], axes[axis]
A = np.transpose(A, axes)
if A.dtype == np.float32:
out = cumsum_eps_f4(np.ravel(np.ascontiguousarray(A)), A.shape, eps)
else:
raise ValueError('Unsupported dtype')
if axis is not None: out = np.transpose(out, axes)
return out
then you can compile it like this (Windows, Visual C++ 2008 Command Line):
\Python27\Scripts\cython.exe cumsum_eps.pyx
cl /c cumsum_eps.c /IC:\Python27\include /IC:\Python27\Lib\site-packages\numpy\core\include
F:\Users\sadaszew\Downloads>link /dll cumsum_eps.obj C:\Python27\libs\python27.lib /OUT:cumsum_eps.pyd
or like this (Linux use .so extension/Cygwin use .dll extension, gcc):
cython cumsum_eps.pyx
gcc -c cumsum_eps.c -o cumsum_eps.o -I/usr/include/python2.7 -I/usr/lib/python2.7/site-packages/numpy/core/include
gcc -shared cumsum_eps.o -o cumsum_eps.so -lpython2.7
and use like this:
from cumsum_eps import *
import numpy as np
x = np.array([[1,2,3,4], [5,6,7,8]], dtype=np.float32)
>>> print cumsum_eps(x)
[ 1. 3. 6. 10. 15. 21. 28. 36.]
>>> print cumsum_eps(x, axis=0)
[[ 1. 2. 3. 4.]
[ 6. 8. 10. 12.]]
>>> print cumsum_eps(x, axis=1)
[[ 1. 3. 6. 10.]
[ 5. 11. 18. 26.]]
>>> print cumsum_eps(x, axis=0, eps=1)
[[ 1. 2. 3. 4.]
[ 6. 8. 10. 12.]]
>>> print cumsum_eps(x, axis=0, eps=2)
[[ 0. 2. 3. 4.]
[ 5. 8. 10. 12.]]
>>> print cumsum_eps(x, axis=0, eps=3)
[[ 0. 0. 3. 4.]
[ 5. 6. 10. 12.]]
>>> print cumsum_eps(x, axis=0, eps=4)
[[ 0. 0. 0. 4.]
[ 5. 6. 7. 12.]]
>>> print cumsum_eps(x, axis=0, eps=8)
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 8.]]
>>> print cumsum_eps(x, axis=1, eps=3)
[[ 0. 0. 3. 7.]
[ 5. 11. 18. 26.]]
and so on, of course normally eps would be some small value, here integers are used just for the sake of demonstration / easiness of typing.
If you need this for double as well the _f8 variants are trivial to write and another case has to be handled in cumsum_eps().
When you're happy with the implementation you should make it a proper part of your setup.py - Cython setup.py
Update #1: If you have good compiler support in run environment you could try [Theano][3] to implement either compensation algorithm or your original idea:
import numpy as np
import theano
import theano.tensor as T
from theano.ifelse import ifelse
A=T.vector('A')
sum=T.as_tensor_variable(np.asarray(0, dtype=np.float64))
res, upd=theano.scan(fn=lambda cur_sum, val: ifelse(T.lt(cur_sum+val, 1.0), np.asarray(0, dtype=np.float64), cur_sum+val), outputs_info=sum, sequences=A)
f=theano.function(inputs=[A], outputs=res)
f([0.9, 2, 3, 4])
will give [0 2 3 4] output. In either Cython or this you get at least +/- performance of the native code.

numpy interpolation to increase a vector size

Hi I have to enlarge the number of points inside of vector to enlarge the vector to fixed size. for example:
for this simple vector
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> len(a)
# 6
now, I want to get a vector with size of 11 taken the a vector as base the results will be
# array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
EDIT 1
what I need is a function that will enter the base vector and the number of values that must be the resultant vector, and I return a new vector with size equal to the parameter. something like
def enlargeVector(vector, size):
.....
return newVector
to use like:
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> b = enlargeVector(a, 200):
>>> len(b)
# 200
and b contains data results of linear, cubic, or whatever interpolation methods
There are many methods to do this within scipy.interpolate. My favourite is UnivariateSpline, which produces an order k spline guaranteed to be differentiable k times.
To use it:
from scipy.interpolate import UnivariateSpline
old_indices = np.arange(0,len(a))
new_length = 11
new_indices = np.linspace(0,len(a)-1,new_length)
spl = UnivariateSpline(old_indices,a,k=3,s=0)
new_array = spl(new_indices)
The s is a smoothing factor that you should set to 0 in this case (since the data are exact).
Note that for the problem you have specified (since a just increases monotonically by 1), this is overkill, since the second np.linspace gives already the desired output.
EDIT: clarified that the length is arbitrary
As AGML pointed out there are tools to do this, but how about a pure numpy solution:
In [20]: a = np.arange(6)
In [21]: temp = np.dstack((a[:-1], a[:-1] + np.diff(a) / 2.0)).ravel()
In [22]: temp
Out[22]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
In [23]: np.hstack((temp, [a[-1]]))
Out[23]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

Categories