My goal is to create an array where each elemet is normal(size={})) of each element of it.
I am trying to oprimize:
it = 2 ** arange(6, 25)
M = zeros(len(it))
for x in range(len(it)):
M[x] = (normal(size=it[x]))
I have these not working so far:
N = zeros(len(it))
it = 2 ** arange(6, 25)
N = (normal(size=it))
Further I tried:
N = (normal(size=it[:]))
Provided my data, I believe that such a manual work, or for loop is really inefficient, so I am trying to come up with vectorized operations.
i receive:
File "mtrand.pyx", line 1335, in numpy.random.mtrand.RandomState.normal
File "common.pyx", line 557, in numpy.random.common.cont
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
you've not been very precise in where these functions are coming from, but I'm guessing that by normal(size=it[:]) you mean:
import numpy as np
it = 2 ** np.arange(6, 25)
np.random.normal(size=it)
which would be telling numpy to create a 19 dimensional array (i.e. len(it)) that contains 6 × 1085 elements (i.e. np.prod(it.astype(float)) as floats because the number overflows an int64). numpy is saying that it can't do that, which seems like a reasonable thing to do.
Numpy doesn't like the "ragged arrays" you're trying to create, neither do most matrix/numeric libraries, hence support is limited!
I'm unsure why you consider that the "loop is really inefficient". You're creating ~33 million of floats from 19 iterations of a simple Python loop. The vast majority of time will be in highly optimised Numpy library code and some tiny (basically unmeasurable) amount of time will be spent evaluating your Python bytecode.
If you really want a one-liner then you can do:
X = [np.random.normal(size=2**i) for i in range(6, 25)]
which makes the split between Numpy and Python worlds more obvious.
Note that on my laptop, the Python code executes in ~5µs while the Numpy code runs for ~800ms. So you're trying to optimise the 0.0006% part!
Note that it's not always a win to use Numpy's vectorization, it only helps with larger arrays, for example the above loop is "faster" than:
X = [np.random.normal(i) for i in 2**np.arange(6, 25)]
4.8 vs 5.1 µs for the Python code, because of the time spent marshalling objects into/out of the Numpy world. Again, none of this matters, just use whichever solution makes your code easier to understand. A few microseconds is nothing compared to seconds.
Related
Here i have a 2 numpy arrays, and a function that will take those arrays as an input, and then do some numpy calculation and return the result. It works as it is but it's slow and i think we can use multiprocessing to make it a bit faster.
Anyway, here's my code :
A = #4 dimensions big numpy array
B = #1 dimension numpy array
def function(A, B):
P = np.einsum("ijkl,ij->kl", A, B)
return P.astype(np.uint8)
result = function(A,B)
I'm still quite new into this Multiprocessing stuff, but think that we're able to make array A and B as a shared memory (maybe using multiprocessing.Array() ??) , and then make multiple processes to compute the function(A, B). But i still can't quite understand how to put all of that in the code.
EDIT:
Alright, so it seems like the approach above doesn't work, but let's try another case, but now, lets say the length of array A is 120, and now i want to use only 3/4 parts of array A from index number 0 to 89 and use all of array B in the Process No.1
And then, i also want to use 3/4 parts of array A but from index number 30 to 119 and use all parts of array B in the Process No.2, will that help? Of course i can make the A array even larger to get it's part computed with even more process where, but the thing is, will this concept works?
I have some code that was originally written in C (by someone else) using C-style malloc arrays. I later converted a lot of it to C++ style, using vector<vector<vector<complex>>> arrays for consistency with the rest of my project. I never timed it, but both methods seemed to be of similar speed.
I recently started a new project in python, and I wanted to use some of this old code. Not wanting to move data back and for between projects, I decided to port this old code into python so that it's all in one project. I naively typed up all of the code in python syntax, replacing any arrays in the old code with numpy arrays (initialising them like this array = np.zeros(list((1024, 1024)), dtype=complex)). The code works fine, but it is excruciatingly slow. If I had to guess, I would say it's on the order of 1000 times slower.
Now having looked into it, I see that a lot of people say numpy is very slow for element-wise operations. While I have used some of the numpy functions for common mathematical operations, such as FFTs and matrix multiplication, most of my code involves nested for loops. A lot of it is pretty complicated and doesn't seem to me to be amenable to reducing to simple array operations that are faster in numpy.
So, I'm wondering if there is an alternative to numpy that is faster for these kind of calculations. The ideal scenario would be that there is a module that I can import that has a lot of the same functionality, so I don't have to rewrite much of my code (i.e., something that can do FFTs and initialises arrays in the same way, etc.), but failing that, I would be happy with something that I could at least use for the more computationally demanding parts of the code and cast back and forth between the numpy arrays as needed.
cpython arrays sounded promising, but a lot of benchmarks I've seen don't show enough of a difference in speed for my purposes. To give an idea of the kind of thing I'm talking about, this is one of the methods that is slowing down my code. This is called millions of times, and the vz_at() method contains a lookup table and does some interpolation to give the final return value:
def tra(self, tr, x, y, z_number, i, scalex, idx, rmax2, rminsq):
M = 1024
ixo = int(x[i] / scalex)
iyo = int(y[i] / scalex)
nx1 = ixo - idx
nx2 = ixo + idx
ny1 = iyo - idx
ny2 = iyo + idx
for ix in range(nx1, nx2 + 1):
rx2 = x[i] - float(ix) * scalex
rx2 = rx2 * rx2
ixw = ix
while ixw < 0:
ixw = ixw + M
ixw = ixw % M
for iy in range(ny1, ny2 + 1):
rsq = y[i] - float(iy) * scalex
rsq = rx2 + rsq * rsq
if rsq <= rmax2:
iyw = iy
while iyw < 0:
iyw = iyw + M
iyw = iyw % M
if rsq < rminsq:
rsq = rminsq
vz = P.vz_at(z_number[i], rsq)
tr[ixw, iyw] += vz
All up, there are a couple of thousands of lines of code; this is just a small snippet to give an example. To be clear, a lot of my arrays are 1024x1024x1024 or 1024x1024 and are complex-valued. Others are one-dimensional arrays on the order of a million elements. What's the best way I can speed these element-wise operations up?
For information, some of your code can be made more concise and thus a bit more readable. For instance:
array = np.zeros(list((1024, 1024)), dtype=complex)).
can be written
array = np.zeros((1024, 1024), dtype=complex)
As you are trying out Python, this is at least a nice benefit :-)
Now, for your problem there are several solutions in the current Python scientific landscape:
Numba is a just-in-time compiler for Python that is dedicated to array processing, achieving good performance when NumPy hits its limits.
Pros: Little to no modification of your code as you just write plain Python, shows good performance in many situations. Numba should recognize some NumPy operations to avoid a Numba->Python->NumPy slowdown.
Cons: Can be tedious to install and hence to distribute Numba-based code.
Cython is a mix of Python and C to generate compiled functions. You can start from a pure Python file and accelerate the code via type annotations and the use of some "C"-isms.
Pros: stable, widely used, relatively easy to distribute Cython-based code.
Cons: need to rewrite the performance critical code, even if only in part.
As an additional hint, Nicolas Rougier (a French scientist) wrote an online book on many situations where you can make use of NumPy to speed up Python code: http://www.labri.fr/perso/nrougier/from-python-to-numpy/
I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.
If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.
Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).
I have a problem of performances in inizializing a dictionary of 4-D numpy tensor.
I have a list of coefficients names:
cnames = ['CN', 'CM', 'CA', 'CY', 'CLN' ...];
that is not fixed-size (it depends from upper code).
For each coefficient i have to generate a 4-D tensor [nalpha X nmach X nbeta X nalt] of zeros (for preallocation purposes), so I do:
#Number of coefficients
numofc = len(cnames);
final_data = {};
#I have to generate <numofc> 4D matrixes
for i in range(numofc):
final_data[cnames[i]]=n.zeros((nalpha,nmach,nbeta,nalt));
each index is an integer between 10 and 30.
each index is an integer between 100 and 200
This takes like 4 minutes. How can I speed up this? Or am I doing something wrong?
The code you posted should not take 4 minutes to run (unless cnames is extremely large or you have very little RAM and is forced to use swap space).
import numpy as np
cnames = ['CN', 'CM', 'CA', 'CY', 'CLN']*1000
nalpha,nmach,nbeta,nalt = 10,20,30,40
#Number of coefficients
numofc = len(cnames)
final_data = {}
#I have to generate <numofc> 4D matrixes
for i in range(numofc):
final_data[cnames[i]]=np.zeros((nalpha,nmach,nbeta,nalt))
Even if cnames has 5000 elements, it should still take only on the order of a couple seconds:
% time test.py
real 0m4.559s
user 0m0.856s
sys 0m3.328s
The semicolons at the end of statements suggests you have experience in some other language. Be careful of translating commands from that language line-by-line into NumPy/Python. Coding in NumPy as one would in C is a recipe for slowness.
In particular, try to avoid updating elements in an array element-by-element. This works fine in C, but is very slow with Python. NumPy achieves speed by delegating to functions coded in Fortran or Cython or C or C++. By updating arrays element-by-element you are using Python loops which are not as fast.
Instead, try to rephrase your computation in terms of operations on whole arrays (or at least, slices of arrays).
I have probably speculated too much on the cause of the problem. You need to profile your code, and then, if you want more specific help, post the result of the profile plus the problematic code (most helpfully in the form of an SSCCE).
I am trying to get down to why one of my python scripts is slow by a factor of about 4 compared to gfortran and I have got to this:
import numpy as np
nvar_x=40
nvar_y=10
def fn_tst(x):
for i in range(int(1e7)):
y=np.repeat(x,1+nvar_y)
return y
x = np.arange(40)
y = fn_tst(x)
print y.min(),y.max()
This is about 13 times slower than the following fortran code
module test
integer,parameter::nvar_x=40,nvar_y=10
contains
subroutine fn_tst(x,y)
real,dimension(nvar_x)::x
real,dimension(nvar_x*(1+nvar_y))::y
do i = 1,10000000
do k = 1,nvar_x
y(k)=x(k)
ibeg=nvar_x+(k-1)*nvar_y+1
iend=ibeg+nvar_y-1
y(ibeg:iend)=x(k)
enddo
enddo
end subroutine fn_tst
end module test
program tst_cp
use test
real,dimension(nvar_x)::x
real,dimension(nvar_x*(1+nvar_y))::y
do k = 1,nvar_x
x(k)=k-1
enddo
call fn_tst(x,y)
print *,minval(y),maxval(y)
stop
end
Can you please suggest ways to speed the python script. Also other pointers to good performance with numpy would be appreciated. I'd rather stick with python than build python wrappers for fortran routines.
Thanks
#isedev, So, is this it. 1.2s gfortran vs. 6.3s for Python? This is the first time I've worried about performance but as I said, I could only get to about a fourth of gfortran speed with Python in the code I was trying to speed up.
And right, sorry the codes were not doing the same thing. Indeed, what you indicate in the loop is more like what I have in the original code.
Unless I'm missing something, I do not agree with the last statement: I have to create y in fn_tst. and np.repeat is just one of the terms on the RHS (place o/p directly in existing array). If I comment out the np.repeat term things are fast...
rhs_slow = rhs[:J]
rhs_fast = rhs[J:]
rhs_fast[:] = c* ( b*in2[3:-1] * ( in2[1:-3] - in2[4:] ) - fast) + hc_ovr_b * np.repeat(slow,K) #slow
For a start, the python code doesn't generate the same output as the fortran code. In the fortran program, y is the sequence 0 to 39, followed by ten 0's, ten 1's, ..., all the way to ten 39's. The python code outputs eleven 0's, eleven 1's all the way to eleven 39's.
This code produces the same output and performs a similar number of memory allcations as your original code:
import numpy as np
nvar_x = 40
nvar_y = 10
def fn_tst(x):
for i in range(10000000):
y = np.empty(nvar_x*(1+nvar_y))
y[0:nvar_x] = x[0:nvar_x]
y[nvar_x:] = np.repeat(x,nvar_y)
return y
x = np.arange(40)
fn_tst(x)
print y.min(), y.max()
On my system (with 1,000,000 loops only), fortran code runs in 1.2s and the above python in 8.6s.
However, this is not a fair comparison: with the fortran code, y is allocated once (outside the fn_tst routine) and with the python code, y is allocated within the fn_tst function.
So, rewriting the Python code as follows provides a better comparison:
import numpy as np
nvar_x = 40
nvar_y = 10
def fn_tst(x,y):
for i in range(10000000):
y[0:nvar_x] = x[0:nvar_x]
y[nvar_x:] = np.repeat(x,nvar_y)
return y
x = np.arange(40)
y = np.empty(nvar_x*(1+nvar_y))
fn_tst(x,y)
print y.min(), y.max()
On my system, the above runs in 6.3s (again, 1,000,000 iterations). So already approx. 25% faster.
The main performance hit in this case though is that numpy.repeat() is generating an array which then needs to be copied back into y. Things would be much faster if numpy.repeat() could be instructed to place its output directly in an existing array (i.e. y in this case)... but that doesn't appear to be possible.