Vectorized assignment of a 2-dimensional array - python

I work with Python 2.7, numpy and pandas.
I have :
a function y=f(x) where both x and y are scalars.
a one-dimensional array of scalars of length n : [x0, x1, ..., x(n-1)]
I need to construct a 2-dimensional array D[i,j]=f(xi)*f(xj) where i,j are indices in [0,...,n-1].
I could use loops and/or a comprehension list, but that would be slow. I would like to use a vectorized approach instead.
I thought that "numpy.indices" would help me (see Create a numpy matrix with elements a function of indices), but I admit I am at a loss on how to use that command for my purpose.
Thanks in advance!

Ignore the comments that dismiss vectorization; it's a good habit to have, and it does deliver performance with the right accelerators. Anyway, what I really wanted to say was that you want to find the outer product:
x_ = numpy.array(x)
y = f(x_)
numpy.outer(y, y)
If you're working with numbers you should be working with numpy data structures anyway. Then you get fast, readable code like this.

I would like to use a vectorized approach instead.
You sound like you might be a Matlab user -- you should be aware that numpy's vectorize function provides no performance benefit:
The vectorize function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
Unless it just so happens that there's already an operation in numpy that does exactly what you want, you're going to be stuck with numpy.vectorize and nothing to really gain over a for loop. With that being said, you should be able to do that like so:
def makeArray():
a = [1, 2, 3, 4]
def addTo(arr):
return f(a[math.floor(arr/4)]) * f(a[arr % 4])
vecAdd = numpy.vectorize(addTo)
return vecAdd(numpy.arange(4 * 4).reshape(4, 4))
EDIT:
If f is actually a one-dimensional array, you can do this:
f_matrix = numpy.matrix(f)
D = f_matrix.T * f_matrix

You can use fromfunc to vectorize the function then use the dot product to multiply:
f2 = numpy.fromfunc(f, 1, 1) # vectorize the function
res1 = f2(x) # get the results for f(x)
res1 = res1[np.newaxis] # result has to be 2D for the next step
res2 = np.dot(a.T, a) # get f(xi)*f(xj)

Related

How do I vectorize a function in python or numpy?

For instance, in Julia language, a function can easily be vectorized as shown
function circumference_of_circle(r)
return 2*π * r
end
a = collect([i for i=1:200])
circumference_of_circle.(a) # easy vactorization using just (.)
Although I like Julia very much, it has not matured like Python.
Is there a similar vectorization technique in the Python function?
In [1]: def foo(r):
...: return 2*np.pi * r
...:
In [2]: arr = np.arange(5)
In [3]: foo(arr)
Out[3]: array([ 0. , 6.28318531, 12.56637061, 18.84955592, 25.13274123])
All operations in your function work with numpy arrays. There's no need to do anything special.
If your function only works with scalar arguments, "vectorizing" becomes trickier, especially if you are seeking compiled performance.
Have you spent much time reading the numpy basics? https://numpy.org/doc/stable/user/absolute_beginners.html
===
I don't know julia, but this code
function _collect(::Type{T}, itr, isz::SizeUnknown) where T
a = Vector{T}()
for x in itr
push!(a, x)
end
return a
end
looks a lot like
def foo(func, arr):
alist = []
for i in arr:
alist.append(func(i))
return alist # or np.array(alist)
or equivalently the list comprehension proposed in the other answer.
or list(map(func, arr))
I'm not familiar with Julia or vectorization of functions, but if I'm understanding correctly, I believe in Python there are a few ways to do this. The plain-jane python way is using list comprehension
An example using your circumference function would be:
def circumference_of_circle(r):
return 2 * 3.14152 * r
circles = [[x, circumference_of_circle(x)] for x in range(1,201)]
print(circles)
circles list will contain inner lists that have both the radius (generated by the range() function) as well as its circumference. Like Julia function vectorization, python list comprehension is just short-hand for loops, but they take in a list object and return a list object, so they are very handy.
Your function contains only simple math. Python's numpy and pandas modules are designed in ways that allow such operations to be performed on them.
import numpy as np
a = np.array([1,2,3,4])
def circumference_of_circle(r):
return 2 * np.pi * r
print(circumference_of_circle(a)) # array([ 6.28318531, 12.56637061, 18.84955592, 25.13274123])
More complicated functions cannot be applied directly to an array. You may be able to rewrite the function in a vectorized way, for example using np.where for conditions that would be represented by an if block within a normal function.
If this isn't an option, or speed is not a major concern, then you can iterate over the list using a list comprehension [func(v) for v in arr], numpy's vectorize, pandas's apply. You can sometimes optimize these approaches by pre-compiling parts of the code.

Piecewise Operation on List of Numpy Arrays

My question is, can I make a function or variable that can perform an on operation or numpy method on each np.array element within a list in a more succinct way than what I have below (preferably by just calling one function or variable)?
Generating the list of arrays:
import numpy as np
array_list = [np.random.rand(3,3) for x in range(5)]
array_list
Current Technique of operating on each element:
My current method (as seen below) involves unpacking it and doing something to it:
[arr.std() for arr in array_list]
[arr + 2 for arr in array_list]
Goal:
My hope it to get something that could perform the operations above by simply typing:
x.std()
or
x +2
Yes - use an actual NumPy array and perform your operations over the desired axes, instead of having them stuffed in a list.
actual_array = np.array(array_list)
actual_array.std(axis=(1, 2))
# array([0.15792346, 0.25781021, 0.27554279, 0.2693581 , 0.28742179])
If you generally wanted all axes except the first, this could be something like tuple(range(1, actual_array.ndim)) instead of explicitly specifying the tuple.

iterate over two numpy arrays return 1d array

I often have a function that returns a single value such as a maximum or integral. I then would like to iterate over another parameter. Here is a trivial example using a parabolic. I don't think its broadcasting since I only want the 1D array. In this case its maximums. A real world example is the maximum power point of a solar cell as a function of light intensity but the principle is the same as this example.
import numpy as np
x = np.linspace(-1,1) # sometimes this is read from file
parameters = np.array([1,12,3,5,6])
maximums = np.zeros_like(parameters)
for idx, parameter in enumerate(parameters):
y = -x**2 + parameter
maximums[idx] = np.max(y) # after I have the maximum I don't need the rest of the data.
print(maximums)
What is the best way to do this in Python/Numpy? I know one simplification is to make the function a def and then use np.vectorize but my understanding is it doesn't make the code any faster.
Extend one of those arrays to 2D and then let broadcasting do those outer additions in a vectorized way -
maximums = (-x**2 + parameters[:,None]).max(1).astype(parameters.dtype)
Alternatively, with the explicit use of the outer addition method -
np.add.outer(parameters, -x**2).max(1).astype(parameters.dtype)

Defining a matrix with unknown size in python

I want to use a matrix in my Python code but I don't know the exact size of my matrix to define it.
For other matrices, I have used np.zeros(a), where a is known.
What should I do to define a matrix with unknown size?
In this case, maybe an approach is to use a python list and append to it, up until it has the desired size, then cast it to a np array
pseudocode:
matrix = []
while matrix not full:
matrix.append(elt)
matrix = np.array(matrix)
You could write a function that tries to modify the np.array, and expand if it encounters an IndexError:
x = np.random.normal(size=(2,2))
r,c = (5,10)
try:
x[r,c] = val
except IndexError:
r0,c0 = x.shape
r_ = r+1-r0
c_ = c+1-c0
if r > 0:
x = np.concatenate([x,np.zeros((r_,x.shape[1]))], axis = 0)
if c > 0:
x = np.concatenate([x,np.zeros((x.shape[0],c_))], axis = 1)
There are problems with this implementation though: First, it makes a copy of the array and returns a concatenation of it, which translates to a possible bottleneck if you use it many times. Second, the code I provided only works if you're modifying a single element. You could do it for slices, and it would take more effort to modify the code; or you can go the whole nine yards and create a new object inheriting np.array and override the .__getitem__ and .__setitem__ methods.
Or you could just use a huge matrix, or better yet, see if you can avoid having to work with matrices of unknown size.
If you have a python generator you can use np.fromiter:
def gen():
yield 1
yield 2
yield 3
In [11]: np.fromiter(gen(), dtype='int64')
Out[11]: array([1, 2, 3])
Beware if you pass an infinite iterator you will most likely crash python, so it's often a good idea to cap the length (with the count argument):
In [21]: from itertools import count # an infinite iterator
In [22]: np.fromiter(count(), dtype='int64', count=3)
Out[22]: array([0, 1, 2])
Best practice is usually to either pre-allocate (if you know the size) or build the array as a list first (using list.append). But lists don't build in 2d very well, which I assume you want since you specified a "matrix."
In that case, I'd suggest pre-allocating an oversize scipy.sparse matrix. These can be defined to have a size much larger than your memory, and lil_matrix or dok_matrix can be built sequentially. Then you can pare it down once you enter all of your data.
from scipy.sparse import dok_matrix
dummy = dok_matrix((1000000, 1000000)) # as big as you think you might need
for i, j, data in generator():
dummy[i,j] = data
s = np.array(dummy.keys).max() + 1
M = dummy.tocoo[:s,:s] #or tocsr, tobsr, toarray . . .
This way you build your array as a Dictionary of Keys (dictionaries supporting dynamic assignment much better than ndarray does) , but still have a matrix-like output that can be (somewhat) efficiently used for math, even in a partially built state.

Vectorization in Numpy - Broadcasting

I have a code in python with the following elements:
I have an intensities vector which is something like this:
array([ 1142., 1192., 1048., ..., 29., 18., 35.])
I have also an x vector which looks like this:
array([ 0, 1, 1, ..., 1060, 1060, 1061])
Then, I have the for loop where I fill another vector, radialDistribution like this:
for i in range(1000):
radialDistribution[i] = sum(intensities[np.where(x == i)]) / len(np.where(x == i)[0])
The problem is that it takes 20 second to complete it...therefore I want to vectorize it. But I am quite new with broadcasting in Numpy and didn't find so much out there...therefore I need your help.
I tried this, but didn't work:
i= np.ogrid[:1000]
intensities[i] = sum(sortedIntensities1D[np.where(sortedDists1D == i)]) / len(np.where(sortedDists1D == i)[0])
Could you help me just telling me where should I look to learn the vectorization procedures with Numpy?
Thanks in advance for your valuable help!
If your x vector has consecutive integers starting at 0, then you can simply do:
radialDistribution = np.bincount(x, weights=intensities) / np.bincount(x)
Here is my implementation of group_by functionality in numpy. It is conceptually similar to the pandas solution; except that this does not require pandas, and ought to become a part of the numpy core, in my opinion.
Using this functionality, your code would look like this:
radialDistribution = group_by(x).mean(intensities)
and would complete in notime.
Look also at the test_radial function defined at the end, which may come even closer to your endgoal.
Here's a method that uses broadcasting:
# arrays need to be at least 2D for broadcasting
x = np.atleast_2d(x)
# create vector of indices
i = np.atleast_2d(np.arange(x.size))
# do the vectorized calculation
bool_eq = (x == i.T)
totals = np.sum(np.where(bool_eq, intensities, 0), axis=1)
rD = totals / np.sum(bool_eq, axis=1)
This uses broadcasting two times: in the operation x == i.T and in the call to np.where. Unfortunately the code above is very slow, even slower than the original. The main bottleneck here is np.where, which we can speed up in this case by taking the product of the Boolean array and the intensities (also by broadcasting):
totals = np.sum(bool_eq*intensities, axis=1)
And this is essentially the same as a matrix-vector product, so we can write:
totals = np.dot(intensities, bool_eq.T)
The end result is a faster code than the original (at least until the memory use for the intermediary array becomes the limiting factor), but you're probably better off with an iterative approach, as suggested by one of the other answers.
Edit: making use of np.einsum was faster still (in my trial):
totals = np.einsum('ij,j', bool_eq, intensities)
Building on my itertools.groupby solution in https://stackoverflow.com/a/22265803/901925 here's a solution that works on 2 small arrays.
import numpy as np
import itertools
intensities = np.arange(12,dtype=float)
x=np.array([1,0,1,2,2,1,0,0,1,2,1,0]) # general, not sorted or consecutive
first a bincount solution, adjusted for nonconsecutive values
# using bincount
# if 'x' are not consecutive
J=np.bincount(x)>0
print np.bincount(x,weights=intensities)[J]/np.bincount(x)[J]
Now a groupby solution
# using groupby;
# sort if need
I=np.argsort(x)
x=x[I]
intensities=intensities[I]
# make a record array for use by groupby
xi=np.zeros(shape=x.shape, dtype=[('intensities',float),('x',int)])
xi['intensities']=intensities
xi['x']=x
g=itertools.groupby(xi, lambda z:z['x'])
xx=np.array([np.array([z[0] for z in y[1]]).mean() for y in g])
print xx
Here's a compact numpy solution, using the return_index option of np.unique, and np.split. x should be sorted. I'm not optimistic about the speed for large arrays, since there will be iteration in unique and split in addition to the comprehension.
[values, index] = np.unique(x, return_index=True)
[y.mean() for y in np.split(intensities, index[1:])]

Categories