Minimizing an array and value in Python - python

I have a vector of floats (coming from an operation on an array) and a float value (which is actually an element of the array, but that's unimportant), and I need to find the smallest float out of them all.
I'd love to be able to find the minimum between them in one line in a 'Pythony' way.
MinVec = N[i,:] + N[:,j]
Answer = min(min(MinVec),N[i,j])
Clearly I'm performing two minimisation calls, and I'd love to be able to replace this with one call. Perhaps I could eliminate the vector MinVec as well.
As an aside, this is for a short program in Dynamic Programming.
TIA.
EDIT: My apologies, I didn't specify I was using numpy. The variable N is an array.

You can append the value, then minimize. I'm not sure what the relative time considerations of the two approaches are, though - I wouldn't necessarily assume this is faster:
Answer = min(np.append(MinVec, N[i, j]))

This is the same thing as the answer above but without using numpy.
Answer = min(MinVec.append(N[i, j]))

Related

the solution of linear combination with integer variables = fixed value

I plan to use python for the solution of next task.
There is an equation:
E=(n[1])*W[1]+ (n[2])*W[2]+..+ (n[N])*W[N]. The W[i],
E are known and are fixed values,
n[i] are integer variables.
I need to find all combinations of n[i] and write them.
Howe can I do it using numpy python?
Looks like a Diophantine equation.
There is no support for this in numpy/scipy and the usual suspect Integer-programming (which can be used to solve this) is also not available within scipy!
The general case is NP-hard!

for every point in a list, compute the mean distance to all other points

I have a numpy array points of shape [N,2] which contains the (x,y) coordinates of N points. I'd like to compute the mean distance of every point to all other points using an existing function (which we'll call cmp_dist and which I just use as a black box).
First a verbose solution in "normal" python to illustrate what I want to do (written from the top of my head):
mean_dist = []
for i,(x0,y0) in enumerate(points):
dist = [
for j,(x1,y1) in enumerate(points):
if i==j: continue
dist.append(comp_dist(x0,y0,x1,y1))
mean_dist.append(np.array(dist).mean())
I already found a "better" solution using list comprehensions (assuming list comprehensions are usually better) which seems to work just fine:
mean_dist = [np.array([cmp_dist(x0,y0,x1,y1) for j,(x1,y1) in enumerate(points) if not i==j]).mean()
for i,(x0,y0) in enumerate(points)]
However, I'm sure there's a much better solution for this in pure numpy, hopefully some function that allows to do an operation for every element using all other elements.
How can I write this code in pure numpy/scipy?
I tried to find something myself, but this is quite hard to google without knowing how such operations are called (my respective math classes are quite a while back).
Edit: Not a duplicate of Fastest pairwise distance metric in python
The author of that question has a 1D array r and is satisfied with what scipy.spatial.distance.pdist(r, 'cityblock') returns (an array containing the distances between all points). However, pdist returns a flat array, that is, is is not clear which of the distances belong to which point (see my answer).
(Although, as explained in that answer, pdist is what I was ultimately looking for, it doesnt solve the problem as I've specified it in the question.)
Based on #ali_m's comment to the question ("Take a look at scipy.spatial.distance.pdist"), I found a "pure" numpy/scipy solution:
from scipy.spatial.distance import cdist
...
fct = lambda p0,p1: great_circle_distance(p0[0],p0[1],p1[0],p1[1])
mean_dist = np.sort(cdist(points,points,fct))[:,1:].mean(1)
definitely
That's for sure an improvement over my list comprehension "solution".
What i don't really like about this, though, is that I have to sort and slice the array to remove the 0.0 values which are the result of computing the distance between identical points (so basically that's my way of removing the diagonal entries of the matrix I get back from cdist).
Note two things about the above solution:
I'm using cdist, not pdist as suggested by #ali_m.
I'm getting back an array of the same size as points, which contains the mean distance from every point to all other points, just as specified in the original question.
pdist unfortunately just returns an array that contains all these mean values in a flat array, that is, the mean values are unlinked from the points they are referring to, which is necessary for the problem as it I've described it in the original question.
However, since in the actual problem at hand I only need the mean over the means of all points (which I did not mention in the question), pdist serves me just fine:
from scipy.spatial.distance import pdist
...
fct = lambda p0,p1: great_circle_distance(p0[0],p0[1],p1[0],p1[1])
mean_dist_overall = pdist(points,fct).mean()
Though this would for sure be the definite answer if I had asked for the mean of the means, but I've purposely asked for the array of means for all points. Because I think there's still room for improvement in the above cdist solution, I won't accept this as THE answer.

How to find a float number in a list?

I am trying to create a list of points from a road network. Here, I try to put their coordinates in a List of [x,y] whose items have a float format. As a new point from the network is picked, it should be checked with the existing points in the list. if it exists, then the same index will be given to the feature of network, otherwise a new point will be added to the list and the new index will be given to the feature.
I know that a float number will be saved differently form integers, but for exactly the same float numbers, I still cannot use:
If new_point in list_of_points:
#do something
and I should use:
for point in list_of_points:
if abs(point.x-new_point.x)<0.01 and abs(point.y-new_point.y)<0.01
#do something
the points are supposed to be exactly the same as I snap them using the ArcGIS software, and when I check the coordinates in the software they are exactly the same.
I asked this question for:
1- I think using "in" can make my code tidy and also faster while using for-loop is kind of clumsy way of coding for this situation.
2- I want to know: does that mean even exactly the same float numbers are stored differently?
It's never a good idea to check for equality between two floating point numbers. However, there are built in functions to do a comparison like that. From numpy you can use allclose. For example,
>>> np.allclose( (1.0,2.0), (1.00000001,2.0000001) )
True
This checks if the two array like inputs are element-wise equal within a certain tolerance. You can adjust the relative and absolute tolerances with keyword arguments.
Any given Python implenetation should always store a given floating point number in the same, deterministic, non-random way within itself. I do not believe you can take the same floating point number, input it twice, and have it stored in two different ways. But I'm also reluctant to believe that you're going to be getting exact duplicates of coordinates out of a geographic program like ArcGIS, especially if the resolution is very small. There are many ways that floating point math can mess with your expectations, so you shouldn't ever expect that you'll have identical floats. And between different machines and different versions, you get even more possibilities for error.
If you're worried about the elegance of your code, you can just create a function to abstract out the for loop.
def coord_in(coord, coord_list):
for other_coord in coord_list:
if abs(coord.x-other_coord.x)<0.00001 and abs(coord.y-other_coord.y)<0.00001:
return True
return False
For a large number of points, numpy will always be faster (and perhapd more elegant). If you have separated the x and y coords into (float) arrays arrx and arry:
numpy.sometrue((arrx-point.x)**2+(arry-point.y)**2<tol**2)
will return True if point is within distance tol of an existing point.
2: exactly the same literal (e.g., "2.3") will be stored as exactly the same float representation for for a given platform and data-type, but in general it depends on the bit-ness, endian-ness and perhaps the compiler used to make python.
To be certain when comparing numbers, you should at least round to the precision of the least precise number, or (better) do the kind of thing you are doing here.
>>> 1==1.00000000000000000000000000000000001
True
Old thread but helped me develop my own solution using list comprehension. Because of course it's not a good idea to compare two floats using ==. The following returns list of indices of all elements of the input list that are reasonably close to the value we're looking for.
def findFloats(listOfFloats, value):
return [i for i, number in enumerate(listOfFloats)
if abs(number-value) < 0.00001]

How to Optimize the Python Code

All,
I am going to compute some feature values using the following python codes. But, because the input sizes are too big, it is very time-consuming. Please help me to optimize the codes.
leaving_volume=len([x for x in pickup_ids if x not in dropoff_ids])
arriving_volume=len([x for x in dropoff_ids if x not in pickup_ids])
transition_volume=len([x for x in dropoff_ids if x in pickup_ids])
union_ids=list(set(pickup_ids + dropoff_ids))
busstop_ids=[x for x in union_ids if self.geoitems[x].fare>0]
busstop_density=np.sum([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
busstop_ids=[x for x in union_ids if self.geoitems[x].balance>0]
smartcard_balance=np.sum([self.geoitems[x].balance for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
Hi, All,
Here is my revised version. I run this code on my GPS traces data. It is faster.
intersect_ids=set(pickup_ids).intersection( set(dropoff_ids) )
union_ids=list(set(pickup_ids + dropoff_ids))
leaving_ids=set(pickup_ids)-intersect_ids
leaving_volume=len(leaving_ids)
arriving_ids=set(dropoff_ids)-intersect_ids
arriving_volume=len(arriving_ids)
transition_volume=len(intersect_ids)
busstop_density=np.mean([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in union_ids if self.geoitems[x].fare>0])
if not busstop_density > 0:
busstop_density = 0
smartcard_balance=np.mean([self.geoitems[x].balance for x in union_ids if self.geoitems[x].balance>0])
if not smartcard_balance > 0:
smartcard_balance = 0
Many thanks for the help.
Just a few things I noticed, as some Python efficiency trivia:
if x not in dropoff_ids
Checking for membership using the in operator is more efficient on a set than a list. But iterating with for through a list is probably more efficient than on a set. So if you want your first two lines to be as efficient as possible you should have both types of data structure around beforehand.
list(set(pickup_ids + dropoff_ids))
It's more efficient to create your sets before you combine data, rather than creating a long list and constructing a set from it. Luckily you probably already have the set versions around now (see the first comment)!
Above all you need to ask yourself the question:
Is the time I save by constructing extra data structures worth the time it takes to construct them?
Next one:
np.sum([...])
I've been trained by Python to think of constructing a list and then applying a function that theoretically only requires a generator as a code smell. I'm not sure if this applies in numpy, since from what I remember it's not completely straightforward to pull data from a generator and put it in a numpy structure.
It looks like this is just a small fragment of your code. If you're really concerned about efficiency I'd recommend making use of numpy arrays rather than lists, and trying to stick within numpy's built-in data structures and function as much as possible. They are likely more highly optimized for raw data crunching in C than the built-in Python functions.
If you're really, really concerned about efficiency then you should probably be doing this data analysis straight-up in C. Especially if you don't have much more code than what you've presented here it might be pretty easy to translate over.
I can only support what machine yerning wrote in his this post. If you are thinking of switching to numpy so if your variables pickup_ids and dropoff_ids were numpy arrays (which maybe they already are else do:
dropoff_ids = np.array( dropoff_ids, dtype='i' )
pickup_ids = np.array( pickup_ids, dtype='i' )
then you can make use of the functions np.in1d() which will give you a True/False array which you can just sum over to get the total number of True entries.
leaving_volume = (-np.in1d( pickup_ids, dropoff_ids )).sum()
transition_volume= np.in1d( dropoff_ids, pickup_ids).sum()
arriving_volume = (-np.in1d( dropoff_ids, pickup_ids)).sum()
somehow I have the feeling that transition_volume = len(pickup_ids) - arriving_volume but I'm not 100% sure right now.
Another function that could be useful to you is np.unique() if you want to get rid of duplicate entries which in a way will turn your array into a set.

any faster alternative?

cost=0
for i in range(12):
cost=cost+math.pow(float(float(q[i])-float(w[i])),2)
cost=(math.sqrt(cost))
Any faster alternative to this? i am need to improve my entire code so trying to improve each statements performance.
thanking u
In addition to the general optimization remarks that are already made (and to which I subscribe), there is a more "optimized" way of doing what you want: you manipulate arrays of values and combine them mathematically. This is a job for the very useful and widely used NumPy package!
Here is how you would do it:
q_array = numpy.array(q, dtype=float)
w_array = numpy.array(w, dtype=float)
cost = math.sqrt(((q_array-w_array)**2).sum())
(If your arrays q and w already contain floats, you can remove the dtype=float.)
This is almost as fast as it can get, since NumPy's operations are optimized for arrays. It is also much more legible than a loop, because it is both simple and short.
Just a hint, but usually real performance improvements come when you evaluate the code at a function or even higher level.
During a good evaluation, you may find whole blocks that code be thrown away or rewritten to simplify the process.
Profilers are useful AFTER you've cleaned up crufty not-very-legible code. irrespective of whether it's to be run once or N zillion times, you should not write code like that.
Why are you doing float(q[i]) and float(w[i])? What type(s) is/are the elements of q and `w'?
If x and y are floats, then x - y will be a float too, so that's 3 apparently redundant occurrences of float() already.
Calling math.pow() instead of using the ** operator bears the overhead of lookups on 'math' and 'pow'.
Etc etc
See if the following code gives the same answers and reads better and is faster:
costsq = 0.0
for i in xrange(12):
costsq += (q[i] - w[i]) ** 2
cost = math.sqrt(costsq)
After you've tested that and understood why the changes were made, you can apply the lessons to other Python code. Then if you have a lot more array or matrix work to do, consider using numpy.
Assuming q and w contain numbers the conversions to float are not necessary, otherwise you should convert the lists to a usable representation earlier (and separately from your calculation)
Given that your function seems to only be doing the equivalent of this:
cost = sum( (qi-wi)**2 for qi,wi in zip(q[:12],w) ) ** 0.5
Perhaps this form would execute faster.

Categories