I'm trying to see if a float can be constructed by adding values from a list of floats. Since the values are experimental, there is another array that contains error room values, such that the float can be constructed from the float list plus 0 or 1 of every value in the error room list, plus some additional error margin. The float may or may not be constructed from these parameters.
The length of lists, error margin and maximum size of the combination values are user defined.
I thought an efficient way to solve this problem was to have a function create and store every possible combination using the given parameters in an array, then check each float to see if it matches the combination, then print out the way the combination was obtained if a match exists.
for example, given a numlist [132.0423,162.0528,176.0321](3-10 values total)
and an errorlist [2.01454,18.0105546] (0-4 values total)
and an error room of 2, and a maximum combination size of 1000,
the float 153.0755519 can be constructed by 132.0423+2.01454+18.0105546 or numlist[0]+errorlist[0]+errorlist[1] and be within the error room
I have no idea how to go about solving such a problem. Perhaps using dynamic programming? I was thinking it would be computationally efficient to create the combinations array once via a separate function, then continuously pass it into a comparison function.
The background: the large floats are fragment masses that come from a mass spectrometer output and my team is attempting to analyze which fragments came from our initial protein that can only fragment into the pieces defined in numlist, but can occasionally lose a small functional group (water, alcohol, hydrogen etc.)
Related
Hello friends!
Summarization:
I got a ee.FeatureCollection containing around 8500 ee.Point-objects. I would like to calculate the distance of these points to a given coordinate, lets say (0.0, 0.0).
For this i use the function geopy.distance.distance() (ref: https://geopy.readthedocs.io/en/latest/#module-geopy.distance). As input the the function takes 2 coordinates in the form of 2 tuples containing 2 floats.
Problem: When i am trying to convert the coordinates in form of an ee.List to float, i always use the getinfo() function. I know this is a callback and it is very time intensive but i don't know another way to extract them. Long story short: To extract the data as ee.Number it takes less than a second, if i want them as float it takes more than an hour. Is there any trick to fix this?
Code:
fc_containing_points = ee.FeatureCollection('projects/ee-philadamhiwi/assets/Flensburg_100') #ee.FeatureCollection
list_containing_points = fc_containing_points.toList(fc_containing_points.size()) #ee.List
fc_containing_points_length = fc_containing_points.size() #ee.Number
for index in range(fc_containing_points_length.getInfo()): #i need to convert ee.Number to int
point_tmp = list_containing_points.get(i) #ee.ComputedObject
point = ee.Feature(point_tmp) #transform ee.ComputedObject to ee.Feature
coords = point.geometry().coordinates() #ee.List containing 2 ee.Numbers
#when i run the loop with this function without the next part
#i got all the data i want as ee.Number in under 1 sec
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0]) #tuple containing 2 floats
#when i add this part to the function it takes hours
PS: This is my first question, pls be patient with me.
I would use .map instead of your looping. This stays server side until you export the table (or possibly do a .getInfo on the whole thing)
fc_containing_points = ee.FeatureCollection('projects/eephiladamhiwi/assets/Flensburg_100')
fc_containing_points.map(lambda feature: feature.set("distance_to_point", feature.distance(ee.Feature(ee.Geometry.Point([0.0,0.0])))
# Then export using ee.batch.Export.Table.toXXX or call getInfo
(An alternative might be to useee.Image.paint to convert the target point to an image then, use ee.Image.distance to calculate the distance to the point (as an image), then use reduceRegions over the feature collection with all points but 1) you can only calculate distance to a certain distance and 2) I don't think it would be any faster.)
To comment on your code, you are probably aware loops (especially client side loops) are frowned upon in GEE (primarily for the performance reasons you've run into) but also note that any time you call .getInfo on a server side object it incurs a performance cost. So this line
coords_as_tuple_of_ints = (coords.getInfo()[1],coords.getInfo()[0])
Would take roughly double the time as this
coords_client = coords.getInfo()
coords_as_tuple_of_ints = (coords_client[1],coords_client[0])
Finally, you could always just export your entire feature collection to a shapefile (using ee.batch.Export.Table.... as above) and do all the operations using geopy locally.
So I want to create functions for the median, mean and mode of a list.
The list must be a user input. How would I go about this? Thanks.
You do not have to create functions for median, mean, mode because they are implemented already and can be called explicitly using Numpy and Scipy libraries in Python. Implementing these functions would mean "reinventing the wheel" and could lead to errors and take time. Feel free to use libraries because in most cases they are tested and safe to use. For example:
import numpy as np
from scipy import stats
mylist = [0,1,2,3,3,4,5,6]
median = np.median(mylist)
mean = np.mean(mylist)
mode = int(stats.mode(mylist)[0])
To get user input you should use input(). See https://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/io.html
If this supposed to be your homework, I'll give you some hint:
mean: iterate through the list, calculate the sum of elements and divide by element count
median: First, you have to sort the list elements in increasing order. Then find out whether the list length is even or odd. If odd, return the center element. If even, return center element and the element next to the the center and calculate their average.
mode: Create a 'helper' list first containing distinct elements of the input list. Then, create a function that has one parameter: a number to - to be counted - how many times it is in the input list. Run this function in a for cycle providing the distinct list elements as input. At each iteration, save the result in a tuple: a tuple consists of (element value, element count). Afterall you should have an array of tuples. When you have all this stuff, select the tuple that has the maximum "element count" and return the corresponding "element value".
Please note that these are just fast hints that can be useful in order to create your own implementation based on the right algorithm you prefer. This could be a good exercise to get started with algorithms and data structures, I hope you'll not skip it:) Good luck!
I'm writing a Python 3.4 script that does a large calculation for me. This calculation involves calculating many many binomial coefficients, and using each of them many times in sums and multiplications with other numbers. Each time a bc (binomial coefficient) is needed in the calculation, it checks whether the bc has already been calculated. If so, it returns this already calculated value. Otherwise, it calculates it and stores it for later look-up. Currently, my function bc(n,k), which calculates the bc "n choose k", looks as follows:
bcvalues = {}
def bc(n,k):
k = min(k,n-k) # take advantage of symmetry
if (n,k) in bcvalues: # check whether value has already been calculated
return bcvalues[(n,k)] # if so, return that already calculated value
if k == 0 or n <= 1: # base case
return 1
result = bc(n-1,k) + bc(n-1,k-1) # Use formula for Pascal's triangle
bcvalues[(n,k)] = result # store the value for later look-up
return result
My look-up table is a dictionary with the (n,k) tuple as the key and bc(n,k) as the value. It satisfies all the
Strict requirements
Can be filled / extended to an arbitrary size at runtime (before the calculation runs, I have no idea how many bc's it needs to calculate, but it's a lot of bc's)
The values can be arbitrarily large (either int (the Python 3 one) or the gmpy2 type mpz, I'm not sure yet). This is important as the values can become very very large
It can be indexed by two natural numbers n and k
The bc's for some tuples (n,k) can be skipped (e.g. there may be an entry for (100,50) but no entry for (100,49))
However, I'm not sure whether it is "the" optimal solution (if there is one) in terms of the
Performance requirements (in the order of importance)
Fast look-up / read-out
Low memory-usage (in tests, my dictionary already occupied several GBs; I may eventually rent computing power on large-memory machines)
Fast writing into the look-up table
In very small input size tests that I've just run, the function bc was called 16 million times, and this number is likely to grow a lot for input sizes that I'm actually interested in. Therefore, performance matters.
My current solution (dictionary) has the advantage that at the end of a computation run, I can serialize the look-up table (using pickle), so that when I perform a new run with higher input values, I can unpickle it and have all the bc's at hand that have been calculated in previous runs. This is a strong bonus point:
Bonus point
The look-up table can easily be serialized
My question
What, besides dictionary, could be a candidate for matching these criteria?
I thought of writing a function that maps tuples (n,k) of the triangle bijectively to natural numbers and then use a list for the look-up table. How promising is this? Other ideas?
Disclaimer: I maintain gmpy2.
Once you start working with integers longer than 50 to 100 decimal digits, you should be using gmpy2.mpz.
A dictionary seems like the best choice. Indexing a list is slightly faster than a dictionary lookup but the overhead of mapping (n,k) to an index value makes it slower on my system.
There may be a way to decrease the memory usage. You calculate binomial coefficient recursively and save all the intermediate values. bcvalues will get very large. If you don't need all the binomial coefficients for smaller values of n and k then you might try using gmpy2.comb to calculate the binomial coefficient and not saving all the intermediate values.
I am trying to create a list of points from a road network. Here, I try to put their coordinates in a List of [x,y] whose items have a float format. As a new point from the network is picked, it should be checked with the existing points in the list. if it exists, then the same index will be given to the feature of network, otherwise a new point will be added to the list and the new index will be given to the feature.
I know that a float number will be saved differently form integers, but for exactly the same float numbers, I still cannot use:
If new_point in list_of_points:
#do something
and I should use:
for point in list_of_points:
if abs(point.x-new_point.x)<0.01 and abs(point.y-new_point.y)<0.01
#do something
the points are supposed to be exactly the same as I snap them using the ArcGIS software, and when I check the coordinates in the software they are exactly the same.
I asked this question for:
1- I think using "in" can make my code tidy and also faster while using for-loop is kind of clumsy way of coding for this situation.
2- I want to know: does that mean even exactly the same float numbers are stored differently?
It's never a good idea to check for equality between two floating point numbers. However, there are built in functions to do a comparison like that. From numpy you can use allclose. For example,
>>> np.allclose( (1.0,2.0), (1.00000001,2.0000001) )
True
This checks if the two array like inputs are element-wise equal within a certain tolerance. You can adjust the relative and absolute tolerances with keyword arguments.
Any given Python implenetation should always store a given floating point number in the same, deterministic, non-random way within itself. I do not believe you can take the same floating point number, input it twice, and have it stored in two different ways. But I'm also reluctant to believe that you're going to be getting exact duplicates of coordinates out of a geographic program like ArcGIS, especially if the resolution is very small. There are many ways that floating point math can mess with your expectations, so you shouldn't ever expect that you'll have identical floats. And between different machines and different versions, you get even more possibilities for error.
If you're worried about the elegance of your code, you can just create a function to abstract out the for loop.
def coord_in(coord, coord_list):
for other_coord in coord_list:
if abs(coord.x-other_coord.x)<0.00001 and abs(coord.y-other_coord.y)<0.00001:
return True
return False
For a large number of points, numpy will always be faster (and perhapd more elegant). If you have separated the x and y coords into (float) arrays arrx and arry:
numpy.sometrue((arrx-point.x)**2+(arry-point.y)**2<tol**2)
will return True if point is within distance tol of an existing point.
2: exactly the same literal (e.g., "2.3") will be stored as exactly the same float representation for for a given platform and data-type, but in general it depends on the bit-ness, endian-ness and perhaps the compiler used to make python.
To be certain when comparing numbers, you should at least round to the precision of the least precise number, or (better) do the kind of thing you are doing here.
>>> 1==1.00000000000000000000000000000000001
True
Old thread but helped me develop my own solution using list comprehension. Because of course it's not a good idea to compare two floats using ==. The following returns list of indices of all elements of the input list that are reasonably close to the value we're looking for.
def findFloats(listOfFloats, value):
return [i for i, number in enumerate(listOfFloats)
if abs(number-value) < 0.00001]
I have a few functions that return an array of data corresponding to parameters ranges.
Example: for a 2d array a, the a_{ij} value corresponds to the parameter set (param1_i, param2_j). How do I return the result and keep the parameter-value correspondence?
Calling the function for each and every of param1_i, para2_j and returning one value would take ages (far more efficient if you do it in one go)
Break the function into (many) smaller functions and make usage difficult? (the point is to get the values for a range of parameters, 1 value is completely useless)
The best I can come up with is make a new numpy dtype, for example for a 2d array:
tagged2d = np.dtype( [('vals', float, 1), ('params', float, (2,))] )
so that a['vals'][i,j] contains the values and a['params'][i,j] the corresponding parameters.
Any thoughts? Maybe I should just return 2 arrays, one with values, other with parameter tuples?
I recommend your last suggestion... just return two arrays {'values': a, 'params':params}.
There are a few reasons for this.
Primarily, your other solution (using dtype and recarrays) tangles too many things together. For example, what about quantities derived from a that correspond to the same parameters... do you make a new recarray and a new copy of the parameters for that? Even something as simple as 2*a becoming the salient quantity will require that you make difficult decisions.
Recarrays have limitations and this is so easily solved in other ways that it's not worth accepting those limitations.
If you want an easier interrelation between the returned terms, you could put the items in a class. For example, you could have a method that takes a param pair and returns the corresponding result. This way, you wouldn't be limited by the recarray, and you could still construct whatever convenience relationship between the two that you like, and easily make backward-compatible change to behavior, etc.