Python script for analyzing pair separations in huge arrays

Python script for analyzing pair separations in huge arrays - python

I am writing a Python script to compare two matrices (A and B) and do a specific range query. Each matrix has three columns and over a million rows. The specific range query involves finding indices of A for each row of B (when A is compared to B) in such a manner that the the pair separation distance is in a given range (e.g. 15 units < Pair Separation < 20 units). (It doesn't matter which array comes first.) And then, I have to repeat it for other slices of pair separations, viz. 20 units < Pair Separation < 25 units and so on till a maximum pair separation range is reached.
I have read that cKDTree of scipy.spatial can do this job very quickly. I tried using cKDTree and followed by the use of query_ball_tree. The query_ball_tree returns output in the form of a list of lists and it works perfectly fine for matrices have ~ 100,000 rows (small size). But, once the size of the matrix (number of rows) becomes closer to a million, I start getting memory errors and my code doesn't work. Obviously, the computer memory doesn't like to store huge arrays. I tried using query_ball_point and that also runs into memory errors.
This looks like a typical big data problem. Time of execution is extremely important for the nature of my work, so I am searching for ways to make the code faster (but, I also run into memory errors and my code comes to a grinding halt)
I was wondering if someone can tell me what is the best method to approach in such a scenario.
: Is the use of scipy.spatial cKDTree and query_ball_point (or query_ball_tree) the best approach for my problem?
: Can I hope to keep to keep using cKDTree and be successful, using some modification?
: I tried to see if I can make some headway by modifying the source of scipy.spatial cKDTree, specifically query_ball_tree or query_ball_point to prevent it from storing huge arrays. But, I am not very experienced with that, so I haven't been successful yet.
: Its very possible that there are solutions that haven't struck my mind. I will very much appreciate any useful suggestion from the people here.

Related

More optimized way to do itertools.combinations

I'm trying to find unique combinations of ~70,000 IDs.
I'm currently doing an itertools.combinations([list name], 2) to get unique 2 ID combinations but it's been running for more than 800 minutes.
Is there a faster way to do this?
I tried converting the IDs into a matrix where the IDs are both the index and the columns and populating the matrix using itertools.product.
I tried doing it the manual way with loops too.
But after more than a full day of letting them run, none of my methods have actually finished running.
For additional information, I'm storing these into a data frame, to later run a function that compares each of the unique set of IDs.

(70_000 ** 69_000) / 2== 2.4 billion - it is not such a large number as to be not computable in a few hours (update I run a dry-run on itertools.product(range(70000), 2) and it took less than 70 seconds, on a 2017 era i7 #3GHz, naively using a single core) But if you are trying to keep this data in memory at once, them it won't fit - and if your system is configured to swap memory to disk before erroring with a MemoryError, this may slow-down the program by 2 or more orders of magnitude, and thus, that is when your problem come from.
itertools.combination does the right thing in this respect, and no need to try to change it for something else: it will yield one combination at a time. What you are doing with the result, however, do change things: if you are streaming the combination to a file and not keeping it in memory, it should be fine, and then, it is just computational time you can't speed up anyway.
If, on the other hand, you are collecting the combinations to a list or other data structure: there is your problem - don't do it.
Now. going a step further than your question, since these combinations are check-able and predictable, maybe trying to generate these is not the right approach at all - you don't give details on how these are to be used, but if used in a reactive form, or on a lazy form, you might have an instantaneous workflow instead.

Your Ram will run full. You can counter this with gc.collect() or emtpying the results but the found results have to be saved inbetween.
You could try something similar to the code below. I would create individual file names or save the results into a database since the result file will be some gb big. Additionaly range of the second loop can probably be divided by 2.
import gc
new_set=set()
for i in range(70000):
new_set.add(i)
print(new_set)
combined_set=set()
for i in range(len(new_set)):
print(i)
if i % 300 ==0:
with open("results","a") as f:
f.write(str(combined_set))
combined_set=set()
gc.collect()
for b in range(len(new_set)):
combined_set.add((i,b))

astropy.io fits efficient element access of a large table

I am trying to extract data from a binary table in a FITS file using Python and astropy.io. The table contains an events array with over 2 million events. What I want to do is store the TIME values of certain events in an array, so I can then do analysis on that array. The problem I have is that, whereas in fortran (using FITSIO) the same operation takes maybe a couple of seconds on a much slower processor, the exact same operation in Python using astropy.io is taking several minutes. I would like to know where exactly the bottleneck is, and if there is a more efficient way to access the individual elements in order to determine whether or not to store each time value in the new array. Here is the code I have so far:
from astropy.io import fits
minenergy=0.3
maxenergy=0.4
xcen=20000
ycen=20000
radius=50
datafile=fits.open('datafile.fits')
events=datafile['EVENTS'].data
datafile.close()
times=[]
for i in range(len(events)):
energy=events['PI'][i]
if energy<maxenergy*1000:
if energy>minenergy*1000:
x=events['X'][i]
y=events['Y'][i]
radius2=(x-xcen)*(x-xcen)+(y-ycen)*(y-ycen)
if radius2<=radius*radius:
times.append(events['TIME'][i])
print times
Any help would be appreciated. I am an ok programmer in other languages, but I have not really had to worry about efficiency in Python before. The reason I have chosen to do this in Python now is that I was using fortran with both FITSIO and PGPLOT, as well as some routines from Numerical Recipes, but the newish fortran compiler I have on this machine cannot be persuaded to produce a properly working program (there are some issues of 32- vs. 64-bit, etc.). Python seems to have all the functionality I need (FITS I/O, plotting, etc), but if it takes forever to access the individual elements in a list, I will have to find another solution.
Thanks very much.

You need to do this using numpy vector operations. Without special tools like numba, doing large loops like you've done will always be slow in Python because it is an interpreted language. Your program should look more like:
energy = events['PI'] / 1000.
e_ok = (energy > min_energy) & (energy < max_energy)
rad2 = (events['X'][e_ok] - xcen)**2 + (events['Y'][e_ok] - ycen)**2
r_ok = rad2 < radius**2
times = events['TIMES'][e_ok][r_ok]
This should have performance comparable to Fortran. You can also filter the entire event table, for instance:
events_filt = events[e_ok][r_ok]
times = events_filt['TIMES']

Sensible storage of 1 billion+ values in a python list type structure

I'm writing a program that creates vario-function plots for a fixed region of a digital elevation model that has been converted to an array. I calculate the variance (difference in elevation) and lag (distance) between point pairs within the window constraints. Every array position is compared with every other array position. For each pair, the lag and variance values are appended to separate lists. Once all pairs have been compared, these lists are then used for data binning, averaging and eventually plotting.
The program runs fine for smaller window sizes (say 60x60 px). For windows up to about 120x120 px or so, which would give 2 lists of 207,360,000 entries, I am able to slowly get the program running. Greater than this, and I run into "MemoryError" reports - e.g. for a 240x240 px region, I would have 3,317,760,000 entries
At the beginning of the program, I create an empty list:
variance = []
lag = []
Then within a for loop where I calculate my lags and variances, I append the values to the different lists:
variance.append(var_val)
lag.append(lag_val)
I've had a look over the stackoverflow pages and have seen a similar issue discussed here. This solution would potentially improve temporal program performance however the solution offered only goes up to 100 million entries and therefore doesn't help me out with the larger regions (as with the 240x240px example). I've also considered using numpy arrays to store the values but I don't think this will stave of the memory issues.
Any suggestions for ways to use some kind of list of the proportions I have defined for the larger window sizes would be much appreciated.
I'm new to python so please forgive any ignorance.
The main bulk of the code can be seen here

Use the array module of Python. It offers some list-like types that are more memory efficient (but cannot be used to store random objects, unlike regular lists). For example, you can have arrays containing regular floats ("doubles" in C terms), or even single-precision floats (four bytes each instead of eight, at the cost of a reduced precision). An array of 3 billion such single-floats would fit into 12 GB of memory.

You could look into PyTables, a library wrapping the HDF5 C library that can be used with numpy and pandas.
Essentially PyTables will store your data on disk and transparently load it into memory as needed.
Alternatively if you want to stick to pure python, you could use a sqlite3 database to store and manipulate your data - the docs say the size limit for a sqlite database is 140TB, which should be enough for your data.

try using heapq, import heapq. It uses the heap for storage rather than the stack allowing you to access the computer full memory.

speeding up numpy.dot inside list comprehension

I have a numpy script that is currently running quite slowly.
spends the vast majority of it's time performing the following operation inside a loop:
terms=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex,Ey,Ez_av)
res=[np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms]
res=array(res)
Ex[1:Nx-1]=res[1:Nx-1,0]
Ey[1:Nx-1]=res[1:Nx-1,1]
It's the list comprehension that is really slowing this code down.
In this case, Coeff_3, and Coeff_2 are length 1000 lists whose elements are 3x3 numpy matricies, and Ex,Ey,Ez, Curl_x, etc are all length 1000 numpy arrays.
I realize it might be faster if i did things like setting a single 3x1000 E vector, but i have to perform a significant amount of averaging of different E vectors between step, which would make things very unwieldy.
Curiously however, i perform this operation twice per loop (once for Ex,Ey, once for Ez), and performing the same operation for the Ez's takes almost twice as long:
terms2=zip(Coeff_3,Coeff_2,Curl_x,Curl_y,Curl_z,Ex_av,Ey_av,Ez)
res2=array([np.dot(C2,array([C_x,C_y,C_z]))+np.dot(C3,array([ex,ey,ez])) for (C3,C2,C_x,C_y,C_z,ex,ey,ez) in terms2])
Anyone have any idea what's happening? Forgive me if it's anything obvious, i'm very new to python.

As pointed out in previous comments, use array operations. np.hstack(), np.vstack(), np.outer() and np.inner() are useful here. You're code could become something like this (not sure about your dimensions):
Cxyz = np.vstack((Curl_x,Curl_y,Curl_z))
C2xyz = np.dot(C2, Cxyz)
...
Check the shape of your resulting dimensions, to make sure you translated your problem right. Sometimes numexpr can also to speed up such tasks significantly with little extra effort,

Explain the speed difference between numpy's vectorized function application VS python's for loop

I was implementing a weighting system called TF-IDF on a set of 42000 images, each consisting 784 pixels. This is basically a 42000 by 784 matrix.
The first method I attempted made use of explicit loops and took more than 2 hours.
def tfidf(color,img_pix,img_total):
if img_pix==0:
return 0
else:
return color * np.log(img_total/img_pix)
...
result = np.array([])
for img_vec in data_matrix:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
try:
result = np.vstack((result,result_row))
# first row will throw a ValueError since vstack accepts rows of same len
except ValueError:
result = result_row
The second method I attempted used numpy matrices and took less than 5 minutes. Note that data_matrix, img_pix_mat are both 42000 by 784 matrices while img_total is a scalar.
result = data_matrix * np.log(np.divide(img_total,img_pix_mat))
I was hoping someone could explain the immense difference in speed.
The authors of the following paper entitled "The NumPy array: a structure for eﬃcient numerical computation" (http://arxiv.org/pdf/1102.1523.pdf), state on the top of page 4 that they observe a 500 times speed increase due to vectorized computation. I'm presuming much of the speed increase I'm seeing is due to this. However, I would like to go a step further and ask why numpy vectorized computations are that much faster than standard python loops?
Also, perhaps you guys might know of other reasons why the first method is slow. Do try/except structures have high overhead? Or perhaps forming a new np.array for each loop is takes a long time?
Thanks.

Due to the internal workings of numpy, (as far as i know, numpy works with C internally, so everything you push down to numpy is actually much faster because it is in a different language)
edit:
taking out the zip, and replacing it with a vstack should go faster too, (zip tends to go slow if the arguments are very large, and than vstack is faster), (but that is also something which puts it into numpy (thus into C), while zip is python)
and yes, if i understand correctly - not sure about that- , you are doing 42k times a try/except block, that should definately be bad for the speed,
test:
T=numpy.ndarray((5,10))
for t in T:
print t.shape
results in (10,)
This means that yes, if your matrices are 42kx784 matrices, you are trying 42k times a try-except block, i am assuming that should put an effect in the computation times, as well as each time doing a zip each time, but not certain if that would be the main cause,
(so every one of your 42k times you run your stuff, takes 0.17sec, i am quite certain that a try/except block doesnt take 0.17 seconds, but maybe the overhead it causes or so, does contribute to it?
try changing the following:
double_vec = zip(img_vec,img_pix_vec)
result_row = np.array([tfidf(x[0],x[1],img_total) for x in double_vec])
to
result_row=np.array([tfidf(img_vec[i],img_pix_vec[i],img_total) for i in xrange(len(img_vec))])
that at least gets rid of the zip statement, But not sure if the zip statement takes down your time by one min, or by nearly two hours (i know zip is slow, compared to numpy vstack, but no clue if that would give you two hours time gain)

The difference you're seeing isn't due to anything fancy like SSE vectorization. There are two primary reasons. The first is that NumPy is written in C, and the C implementation doesn't have to go through the tons of runtime method dispatch and exception checking and so on that a Python implementation goes through.
The second reason is that even for Python code, your loop-based implementation is inefficient. You're using vstack in a loop, and every time you call vstack, it has to completely copy all arrays you've passed to it. That adds an extra factor of len(data_matrix) to your asymptotic complexity.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.