Python pypy: Efficient sum of absolute array/vector difference

Python pypy: Efficient sum of absolute array/vector difference - python

I am trying to reduce the computation time of my script,which is run with pypy.
It has to calculate for a large number of lists/vectors/arrays the pairwise sums of absolute differences.
The length of the input vectors is quite small, between 10 and 500.
I tested three different approaches so far:
1) Naive approach, input as lists:
def std_sum(v1, v2):
distance = 0.0
for (a,b) in izip(v1, v2):
distance += math.fabs(a-b)
return distance
2) With lambdas and reduce, input as lists:
lzi = lambda v1, v2: reduce(lambda s, (a,b):s + math.fabs(a-b), izip(v1, v2), 0)
def lmd_sum(v1, v2):
return lzi(v1, v2)
3) Using numpy, input as numpy.arrays:
def np_sum(v1, v2):
return np.sum(np.abs(v1-v2))
On my machine, using pypy and pairs from itertools.combinations_with_replacement
of 500 such lists, the first two approaches are very similar (roughly 5 seconds),
while the numpy approach is significantly slower, taking around 12 seconds.
Is there a faster way to do the calculations? The lists are read and parsed from text
files and an increased preprocessing time would be no problem (such as creating numpy arrays).
The lists contain floating point numbers and are of equal size which is known beforehand.
The script I use for ''benchmarking'' can be found here and some example data here.

Is there a faster way to do the calculations? The lists are read and parsed from text files and an increased preprocessing time would be no problem (such as creating numpy arrays). The lists contain floating point numbers and are of equal size which is known beforehand.
PyPy is very good at optimizing list accesses, so you should probably stick to using lists.
One thing that will help PyPy optimize things is to make sure your lists always have only one type of objects. I.e. if you read strings from a file, don't put them in a list, then parse them into floats in-place. Rather, create the list with floats, for example by parsing each string as soon as it is read. Likewise, never try to preallocate a list, especially with [None,]*N, or PyPy will not be able to guess that all the elements have the same type.
Second, iterate the list as few times as possible. Your np_sum function walks both arrays three times (subtract, abs, sum) unless PyPy notices and can optimize it. Both 1. and 2. walk the list once, so they are faster.

Related

Endless Permutation and RAM

I have a list with around 1500 number (1 to 1500) and I want to get all the possible Permutation out of it to do some calculations and choose the smallest number out of all of it.
Problem is that the number of possibilites as you can figure is wayyy wayyy too big and my computer just freeze whiel running the code so I have to make a forced restart. Also my RAM is 8GB so it should be big enough (?) or so I hope.
To limit it I can specify a start point but that won't reduce it much.
Also it's a super important thing but I feel so lost. what do you think should I do to make it run ?

Use of generators and itertools saves memory a lot. Generators are just like lists with an exception that new elements generated one by one rather than stored in memory. Even if your problem has a better solution, mastering the generator will help you to save memory in future.
Note in Python 3 map is already a generator (while for Python 2 use imap instead of map)
import itertools
array = [ 1, 2, ... , 1500] # or other numbers
f = sum # or whatever function you have
def fmax_on_perms(array, f):
perms = itertools.permutations(array) # generator of all the
permutations rather than static list
fvalues = map(my_function, perms) # generator of values of your function on different permutations
return max(fvalues)

Most efficent way to assign lists of irregular length to sub-processes for processing

I have a number of objects (roughly 530,000). These objects are randomly assigned to a set of lists (not actually random but let's assume it is). These lists are indexed consecutively and assigned to a dictionary, called groups, according to their index. I know the total number of objects but I do not know the length of each list ahead of time (which in this particular case happens to vary between 1 and 36000).
Next I have to process each object contained in the lists. In order to speed up this operation I am using MPI to send them to different processes. The naive way to do this is to simply assign each process len(groups)/size (where size contains the number of processes used) lists, assign any possible remainder, have it process the contained objects, return the results and wait. This obviously means, however, that if one process gets, say, a lot of very short lists and another all the very long lists the first process will sit idle most of the time and the performance gain will not be very large.
What would be the most efficient way to assign the lists? One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible. But I am not sure how to best implement this. Does anybody have any suggestions?

One approach I could think of is to try and assign the lists in such a way that the sum of the lengths of the lists assigned to each process is as similar as possible.
Assuming that processing time scales exactly with the sum of list lengths, and your processor capacity is homogeneous, this is in fact what you want. This is called the multiprocessor scheduling problem, which is very close to the bin packing problem, but with a constant number of bins minimizing the maximum capacity.
Generally this is a NP-hard problem, so you will not get a perfect solution. The simplest reasonable approach is to greedily pick the largest chunk of work for the processor that has the smallest work assigned to it yet.
It is trivial to implement this in python (examples uses a list of lists):
greedy = [[] for _ in range(nprocs)]
for group in sorted(groups, key=len, reverse=True):
smallest_index = np.argmin([sum(map(len, assignment)) for assignment in greedy])
greedy[smallest_index].append(group)
If you have a large number of processors you may want to optimize the smallest_index computation by using a priority queue. This will produce significantly better results than the naive sorted split as recommended by Attersson:
(https://gist.github.com/Zulan/cef67fa436acd8edc5e5636482a239f8)

On the assumption that a longer list has a larger memory size,your_list has a memory size retrievable by the following code:
import sys
sys.getsizeof(your_list)
(Note: it depends on Python implementation. Please read How many bytes per element are there in a Python list (tuple)?)
There are several ways you can proceed then. If your original "pipeline" of lists can be sorted by key=sys.getSizeof you can then slice and assign to process N every Nth element (Pythonic way to return list of every nth item in a larger list).
Example:
sorted_pipeline = [list1,list2,list3,.......]
sorted_pipeline[0::10] # every 10th item, assign to the first sub-process of 10
This will balance loads in a fair manner, while keeping complexity O(NlogN) due to the original sort and then constant (or linear if the lists are copied) to assign the lists.
Illustration (as requested) of splitting 10 elements into 3 groups:
>>> my_list = [0,1,2,3,4,5,6,7,8,9]
>>> my_list[0::3]
[0, 3, 6, 9]
>>> my_list[1::3]
[1, 4, 7]
>>> my_list[2::3]
[2, 5, 8]
And the final solution:
assigned_groups = {}
for i in xrange(size):
assigned_groups[i] = sorted_pipeline[i::size]
If this is not possible, you can always keep a counter of total queue size, per sub-process pipeline, and tweak probability or selection logic to take that into account.

cross-similarity between users in python

I have a user group list UserGroupA=[CustomerA_id1,CustomerA_id2 ....] containing 1000 users and user group list UserGroupB=[CustomerB_id1,CustomerB_id2 ...] containing 10000 users and I have a similarity function defined for any two users from UserGroupA and UserGroupB
Similarity(CustomerA_id(k),CustomerB_id(l)) where k and l are indices for users in Group A and B.
My objective is to find the most similar 1000 users from Group B to users in GroupA and the way I want to use CrossSimilarity to determine that. Is there a more efficient way to do it especially when the size of GroupB increases?
CrossSimilarity = None * [10000]
for i in range(10000):
for j in range(1000):
CrossSimilarity[i] = CrossSimilarity[i] + Similarity(CustomerA_id[k],CustomerB_id[i])
CrossSimilarity.sort()

It really depends on the Similarity function and how much time it takes. I expect it will heavily dominate your runtime, but without a runtime profile, it's hard to say. I have some general advice only:
Have a look at how you calculate Similarity and whether you can improve the process by doing everyone from group A, or B in one go rather than starting from scratch.
There are some micro-optimisations you can do: For example += will be tiny bit faster. Caching CustomerB_id in outer loop as well. You can likely squeeze some time out of your similarity function the same way. But I wouldn't expect this time to matter.
If your code is using pure python and is CPU-heavy, you could try compiling via CPython, or running in Pypy instead of standard Python.

Since what you are doing is basically a matrix multiplication between the two list (UserGroupA and UserGroupB) a more efficient and fastest way to perform it in memory, could be to use the scikit-sklearn module that provide the function:
sklearn.metrics.pairwise.pairwise_distances(X, Y, metric='euclidean')
where obviously X=UserGroupA and Y=UserGroupB and in metric field you can use the default similarity measure of sklearn or pass your own.
It will return a distance matrix D such that then D_{i, k} is the distance between the ith array from X and the kth array from Y.
Then to find the top 1000 similar user you can simply transform the matrix in a list and sort it.
Maybe is a little more articulated than your solution but should be faster:)

Python: loop optimization

I am quite new to Python but I've started using it for some data analysis and now I love it. Before, I used C, which I find just terrible for file I/O.
I am working on a script which computes the radial distribution function between a set of N=10000 (ten thousands) points in a 3D box with periodic boundary conditions (PBC). Basically, I have a file of 10000 lines made like this:
0.037827 0.127853 -0.481895
12.056849 -12.100216 1.607448
10.594823 1.937731 -9.527205
-5.333775 -2.345856 -9.217283
-5.772468 -10.625633 13.097802
-5.290887 12.135528 -0.143371
0.250986 7.155687 2.813220
which represents the coordinates of the N points. What I have to do is to compute the distance between every couple of points (I hence have to consider all the 49995000 combinations of 2 elements) and then do some operation on it.
Of course the most taxing part of the program is the loop over the 49995000 combinations.
My current function looks like this:
g=[0 for i in range(Nbins)]
for i in range(Nparticles):
for j in range(i+1,Nparticles):
#compute distance and apply PBC
dx=coors[i][0]-coors[j][0]
if(dx>halfSide):
dx-=boxSide
elif(dx<-halfSide):
dx+=boxSide
dy=coors[i][1]-coors[j][1]
if(dy>halfSide):
dy-=boxSide
elif(dy<-halfSide):
dy+=boxSide
dz=coors[i][2]-coors[j][2]
if(dz>halfSide):
dz-=boxSide
elif(dz<-halfSide):
dz+=boxSide
d2=dx**2+dy**2+dz**2
if(d2<(cutoff*boxSide)**2):
g[int(sqrt(d2)/dr)]+=2
Notice: coors is a nested array, created using loadtxt() on the data file.
I basically recycled a function that I used in another program, written in C.
I am not using itertool.combinations() because I have noticed that the program runs slightly slower if I use it for some reason (one iteration is around 111 s while it runs in around 106 with this implementation).
This function takes around 106 s to run, which is terrible considering that I have to analyze some 500 configuration files.
Now, my question is: is there general a way to make this kind of loops run faster using Python? I guess that the corresponding C code would be faster, but I would like to stick to Python because it is so much easier to use.
I would like to stress that even though I'm certainly looking for a particular solution to my problem, I would like to know how to iterate more efficiently when using Python in general.
PS Please try to explain the code in your answers as much as possible (if you write any) because I'm a newbie in Python.
Update
First, I want to say that I know that, since there is a cutoff, I could write a more efficient algorithm if I divide the box in smaller boxes and only compute the distances in neighboring boxes, but for now I would only like to speed up this particular algorithm.
I also want to say that using Cython (I followed this tutorial) I managed to speed up everything a little bit, as expected (77 s, where before it took 106).

If memory is not an issue (and it probably isn't given that the actual amount of data won't be different from what you are doing now), you can use numpy to do the math for you and put everything into an NxN array (around 800MB at 8 bytes/float).
Given the operations your code is trying to do, I do not think you need any loops outside numpy:
g = numpy.zeros((Nbins,))
coors = numpy.array(coors)
#compute distance and apply PBC
dx = numpy.subtract.outer(coors[:, 0], coors[:, 0])
dx[dx < -halfSide] += boxSide
dx[dx > halfSide)] -= boxSide
dy = numpy.subtract.outer(coors[:, 1], coors[:, 1])
dy[dy < -halfSide] += boxSide
dy[dy > halfSide] -= boxSide
dz = numpy.subtract.outer(coors[:, 2], coors[:, 2])
dz[dz < -halfSide] += boxSide
dz[dz > halfSide] -= boxSide
d2=dx**2 + dy**2 + dz**2
# Avoid counting diagonal elements: inf would do as well as nan
numpy.fill_diagonal(d2, numpy.nan)
# This is the same length as the number of times the
# if-statement in the loops would have triggered
cond = numpy.sqrt(d2[d2 < (cutoff*boxSide)**2]) / dr
# See http://stackoverflow.com/a/24100418/2988730
np.add.at(g, cond.astype(numpy.int_), 2)
The point is to always stay away from looping in Python when dealing with large amounts of data. Python loops are inefficient. They perform lots of book-keeping operations that slow down the math code.
Libraries such as numpy and dynd provide operations that run the loops you want "under the hood", usually bypassing the bookkeeping with a C implementation. The advantage is that you can do the Python-level book-keeping once for each enormous array and then get on with processing the raw numbers instead of dealing with Python wrapper objects (numbers are full blown-objects in Python, there are no primitive types).
In this particular example, I have recast your code to use a sequence of array-level operations that do the same thing that your original code did. This requires restructuring how you think about the steps a little, but should save a lot on runtime.

You've got an interesting problem there. There are some common guidelines for high performance Python; these are for Python2 but should carry for the most part to Python3.
Profile your code. Using %timeit and %%timeit on jupyter/ipython is quick and fast for interactive sessions, but cProfile and line_profiler are valuable for finding bottlenecks.
This a link to a short essay that covers the basics from the Python documentation that I found helpful: https://www.python.org/doc/essays/list2str
Numpy is a great package for vectorized operations. Note that numpy vectors are generally slower to work with than small-medium size lists but the memory savings are huge. Speed gain is substantial over lists when doing multi-dimensional arrays. Also if you start observing cache misses and page-faults with pure Python, then the numpy benefit will be even greater.
I have been using Cython recently with a fair amount of success, where numpy/scipy don't quite cut it with the builtin functions.
Check out scipy which has a huge library of functions for scientific computing. There's a lot to explore; eg., the scipy.spatial.pdist function computes fast nC2 pairwise distances. In the testrun below, 10k items pairwise distances completed in 375ms. 100k items will probably break my machine though without refactoring.
import numpy as np
from scipy.spatial.distance import pdist
xyz_list = np.random.rand(10000, 3)
xyz_list
Out[2]:
array([[ 0.95763306, 0.47458207, 0.24051024],
[ 0.48429121, 0.12201472, 0.80701931],
[ 0.26035835, 0.76394588, 0.7832222 ],
...,
[ 0.07835084, 0.8775841 , 0.20906537],
[ 0.73248369, 0.60908474, 0.57163023],
[ 0.68945879, 0.19393467, 0.23771904]])
In [10]: %timeit xyz_pairwise = pdist(xyz_list)
1 loop, best of 3: 375 ms per loop
In [12]: xyz_pairwise = pdist(xyz_list)
In [13]: len(xyz_pairwise)
Out[13]: 49995000
Happy exploring!

Iterate two or more lists / numpy arrays... and compare each item with each other and avoid loops in python

I am new to python and my problem is the following:
I have defined a function func(a,b) that return a value, given two input values.
Now I have my data stored in lists or numpy arrays A,Band would like to use func for every combination. (A and B have over one million entries)
ATM i use this snippet:
for p in A:
for k in B:
value = func(p,k)
This takes really really a lot of time.
So i was thinking that maybe something like this:
C=(map(func,zip(A,B)))
But this method only works pairwise... Any ideas?
Thanks for help

First issue
You need to calculate the output of f for many pairs of values. The "standard" way to speed up this kind of loops (calculations) is to make your function f accept (NumPy) arrays as input, and do the calculation on the whole array at once (ie, no looping as seen from Python). Check any NumPy tutorial to get an introduction.
Second issue
If A and B have over a million entries each, there are one trillion combinations. For 64 bits numbers, that means you'll need 7.3 TiB of space just to store the result of your calculation. Do you have enough hard drive to just store the result?
Third issue
If A and B where much smaller, in your particular case you'd be able to do this:
values = f(*meshgrid(A, B))
meshgrid returns the cartesian product of A and B, so it's simply a way to generate the points that have to be evaluated.
Summary
You need to use NumPy effectively to avoid Python loops. (Or if all else fails or they can't easily be vectorized, write those loops in a compiled language, for instance by using Cython)
Working with terabytes of data is hard. Do you really need that much data?
Any solution that calls a function f 1e12 times in a loop is bound to be slow, specially in CPython (which is the default Python implementation. If you're not really sure and you're using NumPy, you're using it too).

suppose, itertools.product does what you need:
from itertools import product
pro = product(A,B)
C = map(lambda x: func(*x), pro)
so far as it is generator it doesn't require additional memory

One million times one million is one trillion. Calling f one trillion times will take a while.
Unless you have a way of reducing the number of values to compute, you can't do better than the above.

If you use NumPy, you should definitely look the np.vectorize function which is designed for this kind of problems...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pypy: Efficient sum of absolute array/vector difference - python

Related

Endless Permutation and RAM

Most efficent way to assign lists of irregular length to sub-processes for processing

cross-similarity between users in python

Python: loop optimization

Iterate two or more lists / numpy arrays... and compare each item with each other and avoid loops in python

Categories

Resources