I am looking for a memory efficient python script for the following script. The following script works well for smaller dimension but the dimension of the matrix is 5000X5000 in my actual calculation. Therefore, it takes very long time to finish it. Can anyone help me how can I do that?
def check(v1,v2):
if len(v1)!=len(v2):
raise ValueError,"the lenght of both arrays must be the same"
pass
def d0(v1, v2):
check(v1, v2)
return dot(v1, v2)
import numpy as np
from pylab import *
vector=[[0.1, .32, .2, 0.4, 0.8], [.23, .18, .56, .61, .12], [.9, .3, .6, .5, .3], [.34, .75, .91, .19, .21]]
rav= np.mean(vector,axis=0)
#print rav
#print vector
m= vector-rav
corr_matrix=[]
for i in range(0,len(vector)):
tmp=[]
x=sqrt(d0(m[i],m[i]))
for j in range(0,len(vector)):
y=sqrt(d0(m[j],m[j]))
z=d0(m[i],m[j])
w=z/(x*y)
tmp.append(w)
corr_matrix.append(tmp)
print corr_matrix
Make your matrix (and your vector) into numpy arrays instead of Python lists. That will make it take much less memory (and also run faster).
To understand why:
A Python list is a list of Python object instances. Each one of these has type information, pointers, and all kinds of other stuff to keep around beyond just the 8-byte number. Let's say each one ends up being 64 bytes instead of 8. So, that's 64 bytes per element, times 25M elements, equals 1600M bytes!
By contrast, a numpy array is a list of just the raw values, together with a single copy of all that extra information (in the dtype). So, instead of 64 * 25M bytes, you've got 8 * 25M + 64 bytes, which is only 1/8th the size.
As for the speed increase: If you iterate over a 5000x5000 matrix, you're calling some code in the inner loop 25M times. If you're doing a numpy expression like m + m, the code inside the loop is a few lines of C code that get compiled down to a few dozen machine-code operations, which is blazingly fast. If you're doing the loop explicitly in Python, the inside of that loop has to drive the Python interpreter every time through the loop, which is much, much slower. (On top of that, the C compiler will optimize the code, and numpy may have some explicit optimizations too.) Depending on how trivial the work inside the loop is, the speedup can be anywhere from 2x to 10000x. So, even if you have to make things a bit convoluted, try to find a way to express each step as an array broadcast rather than a loop, and it will be much faster.
So, how do you do that? Simple. Instead of this:
corr_matrix=[]
for i in range(len(vector)):
tmp=[]
# …
for j in range(len(vector)):
# …
tmp.append(w)
corr_matrix.append(tmp)
Do this:
corr_matrix=np.zeros((len(vector), len(vector))
for i in range(len(vector)):
# …
for j in range(len(vector)):
# …
corr_matrix[i, j] = w
That immediately eliminates all the memory problems caused by the overhead of keeping around 25M Python float objects, and will give you a significant speed boost too. You can't reduce the memory any further except by not keeping the whole array in memory at once, but you should be fine already. (You can boost the speed even more by using broadcast operations in place of loops, but if the memory is your problem, and the performance is fine, it may not be necessary.)
Related
I have an image stored as a bytestring b'' and am performing per-pixel operations. Right now the fastest way I've found is to use the struct crate to pack and unpack the bytes during modification, then save the pixels to a bytearray
# retrieve image data. Stored as bytestring
pixels = buff.get(rect, 1.0, "CIE LCH(ab) alpha double",
Gegl.AbyssPolicy.CLAMP)
# iterator split into 32-byte chunks for each pixel's 8-byte LCHA channels
pixels_iter = (pixels[x:x + 32] for x in range(0, len(pixels), 32))
new_pixels = bytearray()
# when using `pool.map`, the loop was placed in its own function.
for pixel in pixels_iter:
l, c, h, a = struct.unpack('dddd', pixel)
# simple operation for now: lower chroma if bright and saturated
c = c - (l * c) / 100
new_pixels += struct.pack('dddd', l, c, h, a)
# save new data. everything hereout handled by GEGL instead of myself.
shadow.set(rect, "CIE LCH(ab) alpha double", bytes(new_pixels))
Problem is this takes about 3 1/2 seconds for a 7MP image on my workstation. Fair but not ideal if updates are frequently requested. From what I've gathered, it seems the constant array modification and possibly struct [un]packing are the main culprits. I've refactored this probably a dozen times and I think I'm out of ideas for optimizing this.
I've tried:
struct.unpacking the whole bytestring once instead of each pixel as-needed. Lost about 20% efficiency.
collections.deque Admittedly not familiar with its technicalities. Lost 10-30% depending on implementation
similar results with other iterator helpers like map/join
numpy.array Also admittedly know basically nothing about general numpy. Similar results to deque
multiprocessing seemed to be bottlenecked when I appended the pool.map results to new_pixels. Actually lost about 10% which seems wild, as usually I can just lazily throw threads at problems. The pixels_iter was grouped again into equally sized sublists for each thread, so new_pixels concatenated 8 large lists instead of a few million small lists, which I thought would be faster. Tempted to retry this one as I might've botched it somehow with my 4 am implementation.
In theory it could also work by saving multiple small sections of the image buffer to avoid concatenating to new_pixels entirely, but that would vastly increase code complexity elsewhere.
Converting pixels itself into a bytearray and modifying it in-place using slice ranges. Lost ~30% but also halved memory usage.
Completely separate interpreters like Pypy are off the table, as I'm not the one bundling the Python version.
NumPy should produce much faster results than a manual loop, if you use it properly. Using it properly means using NumPy operations over whole arrays, not just looping manually over a NumPy array.
For example,
new_pixels = bytearray(pixels)
as_numpy = numpy.frombuffer(new_pixels, dtype=float)
as_numpy[1::4] *= 1 - as_numpy[::4] / 100
Now new_pixels contains the adjusted values.
I am quite new to Python but I've started using it for some data analysis and now I love it. Before, I used C, which I find just terrible for file I/O.
I am working on a script which computes the radial distribution function between a set of N=10000 (ten thousands) points in a 3D box with periodic boundary conditions (PBC). Basically, I have a file of 10000 lines made like this:
0.037827 0.127853 -0.481895
12.056849 -12.100216 1.607448
10.594823 1.937731 -9.527205
-5.333775 -2.345856 -9.217283
-5.772468 -10.625633 13.097802
-5.290887 12.135528 -0.143371
0.250986 7.155687 2.813220
which represents the coordinates of the N points. What I have to do is to compute the distance between every couple of points (I hence have to consider all the 49995000 combinations of 2 elements) and then do some operation on it.
Of course the most taxing part of the program is the loop over the 49995000 combinations.
My current function looks like this:
g=[0 for i in range(Nbins)]
for i in range(Nparticles):
for j in range(i+1,Nparticles):
#compute distance and apply PBC
dx=coors[i][0]-coors[j][0]
if(dx>halfSide):
dx-=boxSide
elif(dx<-halfSide):
dx+=boxSide
dy=coors[i][1]-coors[j][1]
if(dy>halfSide):
dy-=boxSide
elif(dy<-halfSide):
dy+=boxSide
dz=coors[i][2]-coors[j][2]
if(dz>halfSide):
dz-=boxSide
elif(dz<-halfSide):
dz+=boxSide
d2=dx**2+dy**2+dz**2
if(d2<(cutoff*boxSide)**2):
g[int(sqrt(d2)/dr)]+=2
Notice: coors is a nested array, created using loadtxt() on the data file.
I basically recycled a function that I used in another program, written in C.
I am not using itertool.combinations() because I have noticed that the program runs slightly slower if I use it for some reason (one iteration is around 111 s while it runs in around 106 with this implementation).
This function takes around 106 s to run, which is terrible considering that I have to analyze some 500 configuration files.
Now, my question is: is there general a way to make this kind of loops run faster using Python? I guess that the corresponding C code would be faster, but I would like to stick to Python because it is so much easier to use.
I would like to stress that even though I'm certainly looking for a particular solution to my problem, I would like to know how to iterate more efficiently when using Python in general.
PS Please try to explain the code in your answers as much as possible (if you write any) because I'm a newbie in Python.
Update
First, I want to say that I know that, since there is a cutoff, I could write a more efficient algorithm if I divide the box in smaller boxes and only compute the distances in neighboring boxes, but for now I would only like to speed up this particular algorithm.
I also want to say that using Cython (I followed this tutorial) I managed to speed up everything a little bit, as expected (77 s, where before it took 106).
If memory is not an issue (and it probably isn't given that the actual amount of data won't be different from what you are doing now), you can use numpy to do the math for you and put everything into an NxN array (around 800MB at 8 bytes/float).
Given the operations your code is trying to do, I do not think you need any loops outside numpy:
g = numpy.zeros((Nbins,))
coors = numpy.array(coors)
#compute distance and apply PBC
dx = numpy.subtract.outer(coors[:, 0], coors[:, 0])
dx[dx < -halfSide] += boxSide
dx[dx > halfSide)] -= boxSide
dy = numpy.subtract.outer(coors[:, 1], coors[:, 1])
dy[dy < -halfSide] += boxSide
dy[dy > halfSide] -= boxSide
dz = numpy.subtract.outer(coors[:, 2], coors[:, 2])
dz[dz < -halfSide] += boxSide
dz[dz > halfSide] -= boxSide
d2=dx**2 + dy**2 + dz**2
# Avoid counting diagonal elements: inf would do as well as nan
numpy.fill_diagonal(d2, numpy.nan)
# This is the same length as the number of times the
# if-statement in the loops would have triggered
cond = numpy.sqrt(d2[d2 < (cutoff*boxSide)**2]) / dr
# See http://stackoverflow.com/a/24100418/2988730
np.add.at(g, cond.astype(numpy.int_), 2)
The point is to always stay away from looping in Python when dealing with large amounts of data. Python loops are inefficient. They perform lots of book-keeping operations that slow down the math code.
Libraries such as numpy and dynd provide operations that run the loops you want "under the hood", usually bypassing the bookkeeping with a C implementation. The advantage is that you can do the Python-level book-keeping once for each enormous array and then get on with processing the raw numbers instead of dealing with Python wrapper objects (numbers are full blown-objects in Python, there are no primitive types).
In this particular example, I have recast your code to use a sequence of array-level operations that do the same thing that your original code did. This requires restructuring how you think about the steps a little, but should save a lot on runtime.
You've got an interesting problem there. There are some common guidelines for high performance Python; these are for Python2 but should carry for the most part to Python3.
Profile your code. Using %timeit and %%timeit on jupyter/ipython is quick and fast for interactive sessions, but cProfile and line_profiler are valuable for finding bottlenecks.
This a link to a short essay that covers the basics from the Python documentation that I found helpful: https://www.python.org/doc/essays/list2str
Numpy is a great package for vectorized operations. Note that numpy vectors are generally slower to work with than small-medium size lists but the memory savings are huge. Speed gain is substantial over lists when doing multi-dimensional arrays. Also if you start observing cache misses and page-faults with pure Python, then the numpy benefit will be even greater.
I have been using Cython recently with a fair amount of success, where numpy/scipy don't quite cut it with the builtin functions.
Check out scipy which has a huge library of functions for scientific computing. There's a lot to explore; eg., the scipy.spatial.pdist function computes fast nC2 pairwise distances. In the testrun below, 10k items pairwise distances completed in 375ms. 100k items will probably break my machine though without refactoring.
import numpy as np
from scipy.spatial.distance import pdist
xyz_list = np.random.rand(10000, 3)
xyz_list
Out[2]:
array([[ 0.95763306, 0.47458207, 0.24051024],
[ 0.48429121, 0.12201472, 0.80701931],
[ 0.26035835, 0.76394588, 0.7832222 ],
...,
[ 0.07835084, 0.8775841 , 0.20906537],
[ 0.73248369, 0.60908474, 0.57163023],
[ 0.68945879, 0.19393467, 0.23771904]])
In [10]: %timeit xyz_pairwise = pdist(xyz_list)
1 loop, best of 3: 375 ms per loop
In [12]: xyz_pairwise = pdist(xyz_list)
In [13]: len(xyz_pairwise)
Out[13]: 49995000
Happy exploring!
I am trying to compute the dot product of two numpy arrays sized respectively (162225, 10000) and (10000, 100). However, if I call numpy.dot(A, B) a MemoryError happens.
I, then, tried to write my implementation:
def slower_dot (A, B):
"""Low-memory implementation of dot product"""
#Assuming A and B are of the right type and size
R = np.empty([A.shape[0], B.shape[1]])
for i in range(A.shape[0]):
for j in range(B.shape[1]):
R[i,j] = np.dot(A[i,:], B[:,j])
return R
and it works just fine, but is of course very slow. Any idea of 1) what is the reason behind this behaviour and 2) how I could circumvent / solve the problem?
I am using Python 3.4.2 (64bit) and Numpy 1.9.1 on a 64bit equipped computer with 16GB of ram running Ubuntu 14.10.
The reason you're getting a memory error is probably because numpy is trying to copy one or both arrays inside the call to dot. For small to medium arrays this is often the most efficient option, but for large arrays you'll need to micro-manage numpy in order to avoid the memory error. Your slower_dot function is slow largely because of the python function call overhead, which you suffer 162225 x 100 times. Here is one common way of dealing with this kind of situation when you want to balance memory and performance limitations.
import numpy as np
def chunking_dot(big_matrix, small_matrix, chunk_size=100):
# Make a copy if the array is not already contiguous
small_matrix = np.ascontiguousarray(small_matrix)
R = np.empty((big_matrix.shape[0], small_matrix.shape[1]))
for i in range(0, R.shape[0], chunk_size):
end = i + chunk_size
R[i:end] = np.dot(big_matrix[i:end], small_matrix)
return R
You'll want to pick the chunk_size that works best for your specific array sizes. Typically larger chunk sizes will be faster as long as everything fits in memory.
I think the problem starts from the matrix A itself as a 16225 * 10000 size matrix already occupies about 12GB of memory if each element is a double precision floating point number. That together with how numpy creates temporary copies to do the dot operation will cause the error. The extra copies is because numpy uses the underlying BLAS operations for dot which needs the matrices to be stored in contiguous C order
Check out these links if you want more discussions about improving dot performance
http://wiki.scipy.org/PerformanceTips
Speeding up numpy.dot
https://github.com/numpy/numpy/pull/2730
I'd like to generate very large 2D-array (or, in other terms, a matrix) using list of lists. Each element should be a float.
So, just to give an example, let's assume to have the following code:
import numpy as np
N = 32000
def largeMat():
m = []
for i in range(N):
l = list(np.ones(N))
m.append(l)
if i % 1000 == 0:
print i
return m
m = largeMat()
I have 12GB of RAM, but as the code reaches the 10000-th line of the matrix, my RAM is already full. Now, if I'm not wrong, each float is 64-bit large (or 8 byte), so the total occupied RAM should be:
32000 * 32000 * 8 / 1 MB = 8192 MB
Why does python fill my whole RAM and even start to allocate into swap?
Python does not necessarily store list items in the most compact form, as lists require pointers to the next item, etc. This is a side effect of having a data type which allows deletes, inserts, etc. For a simple two-way linked list the usage would be two pointers plus the value, in a 64-bit machine that would be 24 octets per float item in the list. In practice the implementation is not that stupid, but there is still some overhead.
If you want to have a concise format, I'd suggest using a numpy.array as it will take exactly as many bytes you think it'd take (plus a small overhead).
Edit Oops. Not necessarily. Explanation wrong, suggestion valid. numpy is the right tool as numpy.array exists for this reason. However, the problem is most probably something else. My computer will run the procedure even though it takes a lot of time (appr. 2 minutes). Also, quitting python after this takes a long time (actually, it hung). Memory use of the python process (as reported by top) peaks at 10 000 MB and then falls down to slightly below 9 000 MB. Probably the allocated numpy arrays are not garbage collected very fast.
But about the raw data size in my machine:
>>> import sys
>>> l = [0.0] * 1000000
>>> sys.getsizeof(l)
8000072
So there seems to be a fixed overhead of 72 octets per list.
>>> listoflists = [ [1.0*i] * 1000000 for i in range(1000)]
>>> sys.getsizeof(listoflists)
9032
>>> sum([sys.getsizeof(l) for l in listoflists])
8000072000
So, this is as expected.
On the other hand, reserving and filling the long list of lists takes a while (about 10 s). Also, quitting python takes a while. The same for numpy:
>>> a = numpy.empty((1000,1000000))
>>> a[:] = 1.0
>>> a.nbytes
8000000000
(The byte count is not entirely reliable, as the object itself takes some space for its metadata, etc. There has to be the pointer to the start of the memory block, data type, array shape, etc.)
This takes much less time. The creation of the array is almost instantaneous, inserting the numbers takes maybe a second or two. Allocating and freeing a lot of small memory chunks is time consuming and while it does not cause fragmentation problems in a 64-bit machine, it is still much easier to allocate a big chunk of data.
If you have a lot of data which can be put into an array, you need a good reason for not using numpy.
I am trying to reduce the computation time of my script,which is run with pypy.
It has to calculate for a large number of lists/vectors/arrays the pairwise sums of absolute differences.
The length of the input vectors is quite small, between 10 and 500.
I tested three different approaches so far:
1) Naive approach, input as lists:
def std_sum(v1, v2):
distance = 0.0
for (a,b) in izip(v1, v2):
distance += math.fabs(a-b)
return distance
2) With lambdas and reduce, input as lists:
lzi = lambda v1, v2: reduce(lambda s, (a,b):s + math.fabs(a-b), izip(v1, v2), 0)
def lmd_sum(v1, v2):
return lzi(v1, v2)
3) Using numpy, input as numpy.arrays:
def np_sum(v1, v2):
return np.sum(np.abs(v1-v2))
On my machine, using pypy and pairs from itertools.combinations_with_replacement
of 500 such lists, the first two approaches are very similar (roughly 5 seconds),
while the numpy approach is significantly slower, taking around 12 seconds.
Is there a faster way to do the calculations? The lists are read and parsed from text
files and an increased preprocessing time would be no problem (such as creating numpy arrays).
The lists contain floating point numbers and are of equal size which is known beforehand.
The script I use for ''benchmarking'' can be found here and some example data here.
Is there a faster way to do the calculations? The lists are read and parsed from text files and an increased preprocessing time would be no problem (such as creating numpy arrays). The lists contain floating point numbers and are of equal size which is known beforehand.
PyPy is very good at optimizing list accesses, so you should probably stick to using lists.
One thing that will help PyPy optimize things is to make sure your lists always have only one type of objects. I.e. if you read strings from a file, don't put them in a list, then parse them into floats in-place. Rather, create the list with floats, for example by parsing each string as soon as it is read. Likewise, never try to preallocate a list, especially with [None,]*N, or PyPy will not be able to guess that all the elements have the same type.
Second, iterate the list as few times as possible. Your np_sum function walks both arrays three times (subtract, abs, sum) unless PyPy notices and can optimize it. Both 1. and 2. walk the list once, so they are faster.