I am trying to create a list of points from a road network. Here, I try to put their coordinates in a List of [x,y] whose items have a float format. As a new point from the network is picked, it should be checked with the existing points in the list. if it exists, then the same index will be given to the feature of network, otherwise a new point will be added to the list and the new index will be given to the feature.
I know that a float number will be saved differently form integers, but for exactly the same float numbers, I still cannot use:
If new_point in list_of_points:
#do something
and I should use:
for point in list_of_points:
if abs(point.x-new_point.x)<0.01 and abs(point.y-new_point.y)<0.01
#do something
the points are supposed to be exactly the same as I snap them using the ArcGIS software, and when I check the coordinates in the software they are exactly the same.
I asked this question for:
1- I think using "in" can make my code tidy and also faster while using for-loop is kind of clumsy way of coding for this situation.
2- I want to know: does that mean even exactly the same float numbers are stored differently?
It's never a good idea to check for equality between two floating point numbers. However, there are built in functions to do a comparison like that. From numpy you can use allclose. For example,
>>> np.allclose( (1.0,2.0), (1.00000001,2.0000001) )
True
This checks if the two array like inputs are element-wise equal within a certain tolerance. You can adjust the relative and absolute tolerances with keyword arguments.
Any given Python implenetation should always store a given floating point number in the same, deterministic, non-random way within itself. I do not believe you can take the same floating point number, input it twice, and have it stored in two different ways. But I'm also reluctant to believe that you're going to be getting exact duplicates of coordinates out of a geographic program like ArcGIS, especially if the resolution is very small. There are many ways that floating point math can mess with your expectations, so you shouldn't ever expect that you'll have identical floats. And between different machines and different versions, you get even more possibilities for error.
If you're worried about the elegance of your code, you can just create a function to abstract out the for loop.
def coord_in(coord, coord_list):
for other_coord in coord_list:
if abs(coord.x-other_coord.x)<0.00001 and abs(coord.y-other_coord.y)<0.00001:
return True
return False
For a large number of points, numpy will always be faster (and perhapd more elegant). If you have separated the x and y coords into (float) arrays arrx and arry:
numpy.sometrue((arrx-point.x)**2+(arry-point.y)**2<tol**2)
will return True if point is within distance tol of an existing point.
2: exactly the same literal (e.g., "2.3") will be stored as exactly the same float representation for for a given platform and data-type, but in general it depends on the bit-ness, endian-ness and perhaps the compiler used to make python.
To be certain when comparing numbers, you should at least round to the precision of the least precise number, or (better) do the kind of thing you are doing here.
>>> 1==1.00000000000000000000000000000000001
True
Old thread but helped me develop my own solution using list comprehension. Because of course it's not a good idea to compare two floats using ==. The following returns list of indices of all elements of the input list that are reasonably close to the value we're looking for.
def findFloats(listOfFloats, value):
return [i for i, number in enumerate(listOfFloats)
if abs(number-value) < 0.00001]
Related
I wrote a C++ wrapper class to some functions in LAPACK. In order to test the class, I use the Python C Extension, where I call numpy, and do the same operations, and compare the results by taking the difference
For example, for the inverse of a matrix, I generate a random matrix in C++, then pass it as a string (with many, many digits, like 30 digits) to Python's terminal using PyRun_SimpleString, and assign the matrix as numpy.matrix(...,dtype=numpy.double) (or numpy.complex128). Then I use numpy.linalg.inv() to calculate the inverse of the same matrix. Finally, I take the difference between numpy's result and my result, and use numpy.isclose with a specific relative tolerance to see whether the results are close enough.
The problem: The problem is that when I use C++ floats, the relative precision I need to be able to compare is about 1e-2!!! And yet with this relative precision I get some statistical failures (with low probability).
Doubles are fine... I can do 1e-10 and it's statistically safe.
While I know that floats have intrinsic bit precision of about 1e-6, I'm wondering why I have to go so low to 1e-2 to be able to compare the results, and it still fails some times!
So, going so low down to 1e-2 got me wondering whether I'm thinking about this whole thing the wrong way. Is there something wrong with my approach?
Please ask for more details if you need it.
Update 1: Eric requested example of Python calls. Here is an example:
//create my matrices
Matrix<T> mat_d = RandomMatrix<T>(...);
auto mat_d_i = mat_d.getInverse();
//I store everything in the dict 'data'
PyRun_SimpleString(std::string("data={}").c_str());
//original matrix
//mat_d.asString(...) will return in the format [[1,2],[3,4]], where 32 is 32 digits per number
PyRun_SimpleString(std::string("data['a']=np.matrix(" + mat_d.asString(32,'[',']',',') + ",dtype=np.complex128)").c_str());
//pass the inverted matrix to Python
PyRun_SimpleString(std::string("data['b_c']=np.matrix(" + mat_d_i.asString(32,'[',']',',') + ",dtype=np.complex128)").c_str());
//inverse in numpy
PyRun_SimpleString(std::string("data['b_p']=np.linalg.inv(data['a'])").c_str());
//flatten the matrices to make comparing them easier (make them 1-dimensional)
PyRun_SimpleString("data['fb_p']=((data['b_p']).flatten().tolist())[0]");
PyRun_SimpleString("data['fb_c']=((data['b_c']).flatten().tolist())[0]");
//make the comparison. The function compare_floats(f1,f2,t) calls numpy.isclose(f1,f2,rtol=t)
//prec is an integer that takes its value from a template function, where I choose the precision I want based on type
PyRun_SimpleString(std::string("res=list(set([compare_floats(data['fb_p'][i],data['fb_c'][i],1e-"+ std::to_string(prec) +") for i in range(len(data['fb_p']))]))[0]").c_str());
//the set above eliminates repeated True and False. If all results are True, we expect that res=[True], otherwise, the test failed somewhere
PyRun_SimpleString(std::string("res = ((len(res) == 1) and res[0])").c_str());
//Now if res is True, then success
Comments in the code describe the procedure step-by-step.
If I have a variable x that returns a bunch of numbers (floats), how can I calculate the difference between all the adjacent numbers (e.g. (x - x-1), (x-1 - x-2) until the last term?).
Look at what you've written down in your question. The answer is there staring at you.
[x[i+1]-x[i] for i in range(len(x)-1)]
One of the nicest things about python is that is has declarative features. You can often get what you want by just describing it; you don't always have to explicitly give the recipe.
I have a vector of floats (coming from an operation on an array) and a float value (which is actually an element of the array, but that's unimportant), and I need to find the smallest float out of them all.
I'd love to be able to find the minimum between them in one line in a 'Pythony' way.
MinVec = N[i,:] + N[:,j]
Answer = min(min(MinVec),N[i,j])
Clearly I'm performing two minimisation calls, and I'd love to be able to replace this with one call. Perhaps I could eliminate the vector MinVec as well.
As an aside, this is for a short program in Dynamic Programming.
TIA.
EDIT: My apologies, I didn't specify I was using numpy. The variable N is an array.
You can append the value, then minimize. I'm not sure what the relative time considerations of the two approaches are, though - I wouldn't necessarily assume this is faster:
Answer = min(np.append(MinVec, N[i, j]))
This is the same thing as the answer above but without using numpy.
Answer = min(MinVec.append(N[i, j]))
Background:My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set() function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)Question:What is the most efficient way to remove very-near-duplicates from a potentially very large data set?My Attempts:I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries. Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).Clarification on 'Duplicates':Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.Note on decimal.Decimal: This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).
You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :
Import NumPy :
import numpy as np
Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random to 128 just to fake it. Use your real data here.) :
a = np.float128(np.random.random((8000,)))
Find the indexes of the unique elements in the rounded array :
_, unique = np.unique(a.round(decimals=14), return_index=True)
And take those indexes from the original (non-rounded) array :
no_duplicates = a[unique]
Why don't you create a dict that maps the 14dp values to the corresponding full 16dp values:
d = collections.defaultdict(list)
for x in l:
d[round(x, 14)].append(x)
Now if you just want "unique" (by your definition) values, you can do
unique = [v[0] for v in d.values()]
I have the following code:
import random
rand1 = random.Random()
rand2 = random.Random()
rand1.seed(0)
rand2.seed(0)
rand1.jumpahead(1)
rand2.jumpahead(2)
x = [rand1.random() for _ in range(0,5)]
y = [rand2.random() for _ in range(0,5)]
According to the documentation of jumpahead() function I expected x and y to be (pseudo)independent sequences. But the output that I get is:
x: [0.038378463064751012, 0.79353887395667977, 0.13619161852307016, 0.82978789012683285, 0.44296031215986331]
y: [0.98374801970498793, 0.79353887395667977, 0.13619161852307016, 0.82978789012683285, 0.44296031215986331]
If you notice, the 2nd-5th numbers are same. This happens each time I run the code.
Am I missing something here?
rand1.seed(0)
rand2.seed(0)
You initialize them with the same values so you get the same (non-)randomness. Use some value like the current unix timestamp to seed it and you will get better values. But note that if you initialize two RNGs at the same time with the current time though, you will get the same "random" values from them of course.
Update: Just noticed the jumpahead() stuff: Have a look at How should I use random.jumpahead in Python - it seems to answer your question.
I think there is a bug, python's documentation does not make this as clear as it should.
The difference between your two parameters to jumpahead is 1, this means you are only guaranteed to get 1 unique value (which is what happens). if you want more values, you need larger parameters.
EDIT: Further Explanation
Originally, as the name suggests, jumpahead merely jumped ahead in the sequence. Its clear to see in that case where jumping 1 or 2 places ahead in the sequence would not produce independent results. As it turns out, jumping ahead in most random number generators is inefficient. For that reason, python only approximates jumping ahead. Because its only approximate, python can implement a more effecient algorithm. However, the method is "pretending" to jump ahead, passing two similiar integers will not result in a very different sequence.
To get different sequences you need the integers passed in to be far apart. In particular, if you want to read a million random integers, you need to seperate your jumpaheads by a million.
As a final note, if you have two random number generators, you only need to jumpahead on one of them. You can (and should) leave the other in its original state.