Vectoriced iterative fixpoint search

Vectoriced iterative fixpoint search - python

I have a function that will always converge to a fixpoint, e.g. f(x)= (x-a)/2+a. I have a function that will find this fixpoint through repetive invoking of the function:
def find_fix_point(f,x):
while f(x)>0.1:
x = f(x)
return x
Which works fine, now I want to do this for a vectoriced version;
def find_fix_point(f,x):
while (f(x)>0.1).any():
x = f(x)
return x
However this is quite inefficient, if most of the instances only need about 10 iterations and one needs 1000. What is a fast method to remove `x that already have been found?
The code can use numpy or scipy.

One way to solve this would be to use recursion:
def find_fix_point_recursive(f, x):
ind = x > 0.1
if ind.any():
x[ind] = find_fix_point_recursive(f, f(x[ind]))
return x
With this implementation, we only call f on the points which need to be updated.
Note that by using recursion we avoid having to do the check x > 0.1 all the time, with each call working on smaller and smaller arrays.
%timeit x = np.zeros(10000); x[0] = 10000; find_fix_point(f, x)
1000 loops, best of 3: 1.04 ms per loop
%timeit x = np.zeros(10000); x[0] = 10000; find_fix_point_recursive(f, x)
10000 loops, best of 3: 141 µs per loop

First for generality,I change the criteria to fit with the fix-point definition : we stop when |x-f(x)|<=epsilon.
You can mix boolean indexing and integer indexing to keep each time the active points. Here a way to do that :
def find_fix_point(f,x,epsilon):
ind=np.mgrid[:len(x)] # initial indices.
while ind.size>0:
xind=x[ind] # integer indexing
yind=f(xind)
x[ind]=yind
ind=ind[abs(yind-xind)>epsilon] # boolean indexing
An example with a lot of fix points :
from matplotlib.pyplot import plot,show
x0=np.linspace(0,1,1000)
x = x0.copy()
def f(x): return x*np.sin(1/x)
find_fix_point(f,x,1e-5)
plot(x0,x,'.');show()

The general method is to use boolean indexing to compute only the ones, that did not yet reach equilibrium.
I adapted the algorithm given by Jonas Adler to avoid maximal recursion depth:
def find_fix_point_vector(f,x):
x = x.copy()
x_fix = np.empty(x.shape)
unfixed = np.full(x.shape, True, dtype = bool)
while unfixed.any():
x_new = f(x) #iteration
x_fix[unfixed] = x_new # copy the values
cond = np.abs(x_new-x)>1
unfixed[unfixed] = cond #find out which ones are fixed
x = x_new[cond] # update the x values that still need to be computed
return x_fix
Edit:
Here I review the 3 solutions proposed. I will call the different fucntions according to their proposer, find_fix_Jonas, find_fix_Jurg, find_fix_BM. I changed the fixpoint condition in all functions according to BM (see updated fucntion of Jonas at the end).
Speed:
%timeit find_fix_BM(f, np.linspace(0,100000,10000),1)
100 loops, best of 3: 2.31 ms per loop
%timeit find_fix_Jonas(f, np.linspace(0,100000,10000))
1000 loops, best of 3: 1.52 ms per loop
%timeit find_fix_Jurg(f, np.linspace(0,100000,10000))
1000 loops, best of 3: 1.28 ms per loop
According to readability I think the version of Jonas is the easiest one to understand, so should be chosen when speed does not matter very much.
Jonas's version however might raise a Runtimeerror, when the number of iterations until fixpoint is reached is large (>1000). The other two solutions do not have this drawback.
The verion of B.M. however might be easier to understand than the version proposed by me.
#
Version of Jonas used:
def find_fix_Jonas(f, x):
fx = f(x)
ind = np.abs(fx-x)>1
if ind.any():
fx[ind] = find_fix_Jonas(f, fx[ind])
return fx

...remove `x that already have been found?
Create a new array using boolean indexing with your condition.
>>> a = np.array([3,1,6,3,9])
>>> a != 3
array([False, True, True, False, True], dtype=bool)
>>> b = a[a != 3]
>>> b
array([1, 6, 9])
>>>

Related

Which operator (+ vs +=) should be used for performance? (In-place Vs not-in-place)

Let's say I have this two snippet of code in python :
1 --------------------------
import numpy as np
x = np.array([1,2,3,4])
y = x
x = x + np.array([1,1,1,1])
print y
2 --------------------------
import numpy as np
x = np.array([1,2,3,4])
y = x
x += np.array([1,1,1,1])
print y
I thought the result of y will be the same in both examples since y point out to x and x become (2,3,4,5), BUT it wasn't
The results were (1,2,3,4) for 1 and (2,3,4,5) for 2.
After some research I find out that in first example
#-First example---------------------------------------
x = np.array([1,2,3,4]) # create x --> [1,2,3,4]
y = x # made y point to x
# unril now we have x --> [1,2,3,4]
# |
# y
x = x + np.array([1,1,1,1])
# however this operation **create a new array** [2,3,4,5]
# and made x point to it instead of the first one
# so we have y --> [1,2,3,4] and x --> [2,3,4,5]
#-Second example--------------------------------------
x = np.array([1,2,3,4]) # create x --> [1,2,3,4]
y = x # made y point to x
# unril now the same x --> [1,2,3,4]
# |
# y
x += np.array([1,1,1,1])
# this operation **Modify the existing array**
# so the result will be
# unril now the same x --> [2,3,4,5]
# |
# y
You can find out more about this behaviors (not only for this example) in this link In-place algorithm
My question is : Being aware of this behavior why should I use in-place algorithm in term of performance? (time of excution faster? less memory alocation?..)
EDIT : Clarification
The example of (+, =+) was just to explain simply the in-place algorithm to the one who don't know.. but the question was in general the use of in-place algorithm not only in this case..
As another more complex example: loading a CSV file (just 10 Million rows) in a variable then sorting the result, is the idea of in-place algorithm is to produce an output in the same memory space that contains the input by successively transforming that data until the output is produced? - This avoids the need to use twice the storage - one area for the input and an equal-sized area for the output ( Using the minimum amount of RAM, hard disk ... )

x = x + 1 vs x += 1
Performance
It seems that you understand the semantical difference between x += 1 and x = x + 1.
For benchmarking, you can use timeit in IPython.
After defining those functions:
import numpy as np
def in_place(n):
x = np.arange(n)
x += 1
def not_in_place(n):
x = np.arange(n)
x = x + 1
def in_place_no_broadcast(n):
x = np.arange(n)
x += np.ones(n, dtype=np.int)
You can simply use the %timeit syntax to compare performances:
%timeit in_place(10**7)
20.3 ms ± 81.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit not_in_place(10**7)
30.4 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit in_place_no_broadcast(10**7)
35.4 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
not_in_place is 50% slower than in_place.
Note that broadcasting also makes a huge difference : numpy understands x += 1 as adding a 1 to every single element of x, without having to create yet another array.
Warning
in_place should be the preferred function: it's faster and uses less memory. You might run into bugs if you use and mutate this object at different places in your code, though. The typical example would be :
x = np.arange(5)
y = [x, x]
y[0][0] = 10
y
# [array([10, 1, 2, 3, 4]), array([10, 1, 2, 3, 4])]
Sorting
Your understanding of the advantages of in-place sorting is correct. It can make a huge difference in memory requirements when sorting large data sets.
There are other desirable features for a sorting algorithm (stable, acceptable worst-case complexity, ...) and it looks like the standard Python algorithm (Timsort) has many of them.
Timsort is an hybrid algorithm. Some parts of it are in-place and some require extra memory. It will never use more than n/2 though.

Optimize Python: Large arrays, memory problems

I'm having a speed problem running a python / numypy code. I don't know how to make it faster, maybe someone else?
Assume there is a surface with two triangulation, one fine (..._fine) with M points, one coarse with N points. Also, there's data on the coarse mesh at every point (N floats). I'm trying to do the following:
For every point on the fine mesh, find the k closest points on coarse mesh and get mean value. Short: interpolate data from coarse to fine.
My code right now goes like that. With large data (in my case M = 2e6, N = 1e4) the code runs about 25 minutes, guess due to the explicit for loop not going into numpy. Any ideas how to solve that one with smart indexing? M x N arrays blowing the RAM..
import numpy as np
p_fine.shape => m x 3
p.shape => n x 3
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm(ps-p,axis=1))[:k]])
Cheers!

First of all thanks for the detailed help.
First, Divakar, your solutions gave substantial speed-up. With my data, the code ran for just below 2 minutes depending a bit on the chunk size.
I also tried my way around sklearn and ended up with
def sklearnSearch_v3(p, p_fine, k):
neigh = NearestNeighbors(k)
neigh.fit(p)
return data_coarse[neigh.kneighbors(p_fine)[1]].mean(axis=1)
which ended up being quite fast, for my data sizes, I get the following
import numpy as np
from sklearn.neighbors import NearestNeighbors
m,n = 2000000,20000
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 3
yields
%timeit sklearv3(p, p_fine, k)
1 loop, best of 3: 7.46 s per loop

Approach #1
We are working with large sized datasets and memory is an issue, so I will try to optimize the computations within the loop. Now, we can use np.einsum to replace np.linalg.norm part and np.argpartition in place of actual sorting with np.argsort, like so -
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
out = out/k
Approach #2
Now, as another approach we can also use Scipy's cdist for a fully vectorized solution, like so -
from scipy.spatial.distance import cdist
out = data_coarse[np.argpartition(cdist(p_fine,p),k,axis=1)[:,:k]].mean(1)
But, since we are memory bound here, we can perform these operations in chunks. Basically, we would get chunks of rows from that tall array p_fine that has millions of rows and use cdist and thus at each iteration get chunks of output elements instead of just one scalar. With this, we would cut the loop count by the length of that chunk.
So, finally we would have an implementation like so -
out = np.empty((m,))
L = 10 # Length of chunk (to be used as a param)
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].mean(1)
Runtime test
Setup -
# Setup inputs
m,n = 20000,100
p_fine = np.random.rand(m,3)
p = np.random.rand(n,3)
data_coarse = np.random.rand(n)
k = 5
def original_approach(p,p_fine,m,n,k):
data_fine = np.empty((m,))
for i, ps in enumerate(p_fine):
data_fine[i] = np.mean(data_coarse[np.argsort(np.linalg.norm\
(ps-p,axis=1))[:k]])
return data_fine
def proposed_approach(p,p_fine,m,n,k):
out = np.empty((m,))
for i, ps in enumerate(p_fine):
subs = ps-p
sq_dists = np.einsum('ij,ij->i',subs,subs)
out[i] = data_coarse[np.argpartition(sq_dists,k)[:k]].sum()
return out/k
def proposed_approach_v2(p,p_fine,m,n,k,len_per_iter):
L = len_per_iter
out = np.empty((m,))
num_iter = m//L
for j in range(num_iter):
p_fine_slice = p_fine[L*j:L*j+L]
out[L*j:L*j+L] = data_coarse[np.argpartition(cdist\
(p_fine_slice,p),k,axis=1)[:,:k]].sum(1)
return out/k
Timings -
In [134]: %timeit original_approach(p,p_fine,m,n,k)
1 loops, best of 3: 1.1 s per loop
In [135]: %timeit proposed_approach(p,p_fine,m,n,k)
1 loops, best of 3: 539 ms per loop
In [136]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=100)
10 loops, best of 3: 63.2 ms per loop
In [137]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=1000)
10 loops, best of 3: 53.1 ms per loop
In [138]: %timeit proposed_approach_v2(p,p_fine,m,n,k,len_per_iter=2000)
10 loops, best of 3: 63.8 ms per loop
So, there's about 2x improvement with the first proposed approach and 20x over the original approach with the second one at the sweet spot with the len_per_iter param set at 1000. Hopefully this will bring down your 25 minutes runtime to little over a minute. Not bad I guess!

Vectorize repetitive math function in Python

I have a mathematical function of this form $f(x)=\sum_{j=0}^N x^j * \sin(j*x)$ that I would like to compute efficiently in Python. N is of order ~100. This function f is evaluated thousands of times for all entries x of a huge matrix, and therefore I would like to improve the performance (profiler indicates that calculation of f takes up most of the time). In order to avoid the loop in the definition of the function f I wrote:
def f(x)
J=np.arange(0,N+1)
return sum(x**J*np.sin(j*x))
The issue is that if I want to evaluate this function for all entries of a matrix, I would need to use numpy.vectorize first, but as far as I know this not necessarily faster than a for loop.
Is there an efficient way to perform a calculation of this type?

Welcome to Sack Overflow! ^^
Well, calculating something ** 100 is some serious thing. But notice how, when you declare your array J, you are forcing your function to calculate x, x^2, x^3, x^4, ... (and so on) independently.
Let us take for example this function (which is what you are using):
def powervector(x, n):
return x ** np.arange(0, n)
And now this other function, which does not even use NumPy:
def power(x, n):
result = [1., x]
aux = x
for i in range(2, n):
aux *= x
result.append(aux)
return result
Now, let us verify that they both calculate the same thing:
In []: sum(powervector(1.1, 10))
Out[]: 15.937424601000005
In []: sum(power(1.1, 10))
Out[]: 15.937424601000009
Cool, now let us compare the performance of both (in iPython):
In [36]: %timeit sum(powervector(1.1, 10))
The slowest run took 20.42 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 3.52 µs per loop
In [37]: %timeit sum(power(1.1, 10))
The slowest run took 5.28 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.13 µs per loop
It is faster, as you are not calculating all the powers of x, because you know that x ^ N == (x ^ N - 1) * x and you take advantage of it.
You could use this to see if your performance improves. Of course you can change power() to use NumPy vectors as output. You can also have a look at Numba, which is easy to try and may improve performance a bit as well.
As you see, this is only a hint on how to improve some part of your problem. I bet there are a couple of other ways to further improve your code! :-)
Edit
It seems that Numba might not be a bad idea... Simply adding #numba.jit decorator:
#numba.jit
def powernumba(x, n):
result = [1., x]
aux = x
for i in range(2, n):
aux *= x
result.append(aux)
return result
Then:
In [52]: %timeit sum(power(1.1, 100))
100000 loops, best of 3: 7.67 µs per loop
In [51]: %timeit sum(powernumba(1.1, 100))
The slowest run took 5.64 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 2.64 µs per loop
It seems Numba can do some magic there. ;-)

For a scalar x:
>>> import numpy as np
>>> x = 0.5
>>> jj = np.arange(10)
>>> x**jj
array([ 1. , 0.5 , 0.25 , 0.125 , 0.0625 ,
0.03125 , 0.015625 , 0.0078125 , 0.00390625, 0.00195312])
>>> np.sin(jj*x)
array([ 0. , 0.47942554, 0.84147098, 0.99749499, 0.90929743,
0.59847214, 0.14112001, -0.35078323, -0.7568025 , -0.97753012])
>>> (x**jj * np.sin(jj*x)).sum()
0.64489974041068521
Notice the use of the sum method of numpy arrays (equivalently, use np.sum not built-in sum).
If your x is itself an array, use broadcasting:
>>> a = x[:, None]**jj
>>> a.shape
(3, 10)
>>> x[0]**jj == a[0]
array([ True, True, True, True, True, True, True, True, True, True], dtype=bool)
Then sum over a second axis:
>>> res = a * np.sin(jj * x[:, None])
>>> res.shape
(3, 10)
>>> res.sum(axis=1)
array([ 0.01230993, 0.0613201 , 0.17154859])

Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?

I have two 1 dimensional numpy vectors va and vb which are being used to populate a matrix by passing all pair combinations to a function.
na = len(va)
nb = len(vb)
D = np.zeros((na, nb))
for i in range(na):
for j in range(nb):
D[i, j] = foo(va[i], vb[j])
As it stands, this piece of code takes a very long time to run due to the fact that va and vb are relatively large (4626 and 737). However I am hoping this can be improved due to the fact that a similiar procedure is performed using the cdist method from scipy with very good performance.
D = cdist(va, vb, metric)
I am obviously aware that scipy has the benefit of running this piece of code in C rather than in python - but I'm hoping there is some numpy function im unaware of that can execute this quickly.

One of the least known numpy functions for what the docs call functional programming routines is np.frompyfunc. This creates a numpy ufunc from a Python function. Not some other object that closely simulates a numpy ufunc, but a proper ufunc with all its bells and whistles. While the behavior is in many aspects very similar to np.vectorize, it has some distinct advantages, that hopefully the following code should highlight:
In [2]: def f(a, b):
...: return a + b
...:
In [3]: f_vec = np.vectorize(f)
In [4]: f_ufunc = np.frompyfunc(f, 2, 1) # 2 inputs, 1 output
In [5]: a = np.random.rand(1000)
In [6]: b = np.random.rand(2000)
In [7]: %timeit np.add.outer(a, b) # a baseline for comparison
100 loops, best of 3: 9.89 ms per loop
In [8]: %timeit f_vec(a[:, None], b) # 50x slower than np.add
1 loops, best of 3: 488 ms per loop
In [9]: %timeit f_ufunc(a[:, None], b) # ~20% faster than np.vectorize...
1 loops, best of 3: 425 ms per loop
In [10]: %timeit f_ufunc.outer(a, b) # ...and you get to use ufunc methods
1 loops, best of 3: 427 ms per loop
So while it is still clearly inferior to a properly vectorized implementation, it is a little faster (the looping is in C, but you still have the Python function call overhead).

cdist is fast because it is written in highly-optimized C code (as you already pointed out), and it only supports a small predefined set of metrics.
Since you want to apply the operation generically, to any given foo function, you have no choice but to call that function na-times-nb times. That part is not likely to be further optimizable.
What's left to optimize are the loops and the indexing. Some suggestions to try out:
Use xrange instead of range (if in python2.x. in python3, range is already a generator-like)
Use enumerate, instead of range + explicitly indexing
Use a python speed "magic", such as cython or numba, to speed up the looping process.
If you can make further assumptions about foo, it might be possible to speed it up further.

Like #shx2 said, it all depends on what is foo. If you can express it in terms of numpy ufuncs, then use outer method:
In [11]: N = 400
In [12]: B = np.empty((N, N))
In [13]: x = np.random.random(N)
In [14]: y = np.random.random(N)
In [15]: %%timeit
for i in range(N):
for j in range(N):
B[i, j] = x[i] - y[j]
....:
10 loops, best of 3: 87.2 ms per loop
In [16]: %timeit A = np.subtract.outer(x, y) # <--- np.subtract is a ufunc
1000 loops, best of 3: 294 µs per loop
Otherwise you can push the looping down to cython level. Continuing a trivial example above:
In [45]: %%cython
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def foo(double[::1] x, double[::1] y, double[:, ::1] out):
cdef int i, j
for i in xrange(x.shape[0]):
for j in xrange(y.shape[0]):
out[i, j] = x[i] - y[j]
....:
In [46]: foo(x, y, B)
In [47]: np.allclose(B, np.subtract.outer(x, y))
Out[47]: True
In [48]: %timeit foo(x, y, B)
10000 loops, best of 3: 149 µs per loop
The cython example is deliberately made overly simplistic: in reality you might want to add some shape/stride checks, allocate the memory within your function etc.

Replace rarely occurring values in a pandas dataframe

I have a moderately large (~60,000 rows by 15 columns) csv file that I'm working on with pandas. Each row represents an individual and contains personal data. I want to render the data anonymous. One way I want to do so is by replacing values in a particular column where they are rare. I initially tried to do so as follows:
def clean_data(entry):
if df[df.column_name == entry].index.size < 10:
return 'RARE_VALUE'
else:
return entry
df.new_column_name = df.column_name.apply(clean_data)
But running it froze my system every time. This unfortunately means I have no useful debugging data. Does anyone know the correct way to do this? The column contains both strings and null values.

I think you want to groupby column name:
g = df.groupby('column_name')
You can use a filter, for example, to return only those rows who have something in column_name which appears more than 10 times:
g.filter(lambda x: len(x) >= 10)
To overwrite the column with 'RARE_VALUE' you can use transform (which calculates the result once for each group, and spreads it around appropriately):
df.loc[g[col].transform(lambda x: len(x) < 10).astype(bool), col] = 'RARE_VALUE'
As DSM points out, the following trick is much faster:
df.loc[df[col].value_counts()[df[col]].values < 10, col] = "RARE_VALUE"
Here's some timeit information (to show how impressive DSM's solution is!):
In [21]: g = pd.DataFrame(np.random.randint(1, 100, (1000, 2))).groupby(0)
In [22]: %timeit g.filter(lambda x: len(x) >= 10)
10 loops, best of 3: 67.2 ms per loop
In [23]: %timeit df.loc[g[1].transform(lambda x: len(x) < 10).values.astype(bool), 1]
10 loops, best of 3: 44.6 ms per loop
In [24]: %timeit df.loc[df[1].value_counts()[df[1]].values < 10, 1]
1000 loops, best of 3: 1.57 ms per loop

#Andy Hayden solves the issue in various ways. I would recommend using pipelines for this kind of task though. The following may seem more unwieldy, but it comes in handy if you want to save the whole pipeline as an object, or if you have to generalize predictions on a test set:
class RemoveScarceValuesFeatureEngineer:
def __init__(self, min_occurences):
self._min_occurences = min_occurences
self._column_value_counts = {}
def fit(self, X, y):
for column in X.columns:
self._column_value_counts[column] = X[column].value_counts()
return self
def transform(self, X):
for column in X.columns:
X.loc[self._column_value_counts[column][X[column]].values
< self._min_occurences, column] = "RARE_VALUE"
return X
def fit_transform(self, X, y):
self.fit(X, y)
return self.transform(X)
You may find more informations here: Pandas replace rare values

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectoriced iterative fixpoint search - python

...remove `x that already have been found? Create a new array using boolean indexing with your condition. >>> a = np.array([3,1,6,3,9]) >>> a != 3 array([False, True, True, False, True], dtype=bool) >>> b = a[a != 3] >>> b array([1, 6, 9]) >>>

Related

Which operator (+ vs +=) should be used for performance? (In-place Vs not-in-place)

Optimize Python: Large arrays, memory problems

Vectorize repetitive math function in Python

Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?

Replace rarely occurring values in a pandas dataframe

Categories

Resources