Efficient lookup table for collection of numpy arrays - python

I would like to know what's the most efficient way to create a lookup table for floats (and collection of floats) in Python. Since both sets and dicts need the keys to be hashable, I guess can't use some sort of closeness to check for proximity to already inserted, can I? I have seen this answer and it's not quite what I'm looking for as I don't want to give the burden of creating the right key to the user and also I need to extend it for collections of floats.
For example, given the following code:
>>> import numpy as np
>>> a = {np.array([0.01, 0.005]): 1}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
>>> a = {tuple(np.array([0.01, 0.005])): 1}
>>> tuple(np.array([0.0100000000000001,0.0050002])) in a
False
I would like the last statement to return True. Coming from a C++ world, I would create a std::map and provide a compare function that can do the comparison with some user defined tolerance to check if the values have been added to the data structure. Of course this question extends naturally to the lookup tables of arrays (for example numpy arrays). So what's the most efficient way to accomplish what I'm looking for?

Since you are interested in 3D points, you could think about using some data-structure that is optimized for storing spatial data, such as a KD-tree. This is available in Scipy and allows the lookup of the point closest to a given coordinate. After you have looked up the this point, you could then do a check to see if you are within some tolerance to accept the new point or not.
Usage should be something like this (untested, never used it myself):
from scipy.spatial import KDTree
points = ... # points is [Nx3]
tree = KDTree(points)
new_point = ... # array of length 3
distance, nearest_index = tree.query(new_point)
if distance > tolerance: # add point
points = np.vstack((points, new_point))
tree = KDTree(points) # generate tree from scratch
Note that a KD-tree is efficient for looking up a point in a static collection of points (cost of a lookup is O(log(N)), but they are not optimized for repeatedly adding new points. The Scipy implementation even lacks a method to add new points, so you have to generate a new tree every time you insert a new point. Since this operation is probably O(N*log(N)), it might be faster to just to do brute-force calculation of all distances, which costs O(N). Note also that there is an alternative version cKDTree, which might be implemented in C for speed, the documentation is not really clear on this.

Related

How to create list of numbers as user input and then finding mean, median and mode of the provided list

So I want to create functions for the median, mean and mode of a list.
The list must be a user input. How would I go about this? Thanks.
You do not have to create functions for median, mean, mode because they are implemented already and can be called explicitly using Numpy and Scipy libraries in Python. Implementing these functions would mean "reinventing the wheel" and could lead to errors and take time. Feel free to use libraries because in most cases they are tested and safe to use. For example:
import numpy as np
from scipy import stats
mylist = [0,1,2,3,3,4,5,6]
median = np.median(mylist)
mean = np.mean(mylist)
mode = int(stats.mode(mylist)[0])
To get user input you should use input(). See https://anh.cs.luc.edu/python/hands-on/3.1/handsonHtml/io.html
If this supposed to be your homework, I'll give you some hint:
mean: iterate through the list, calculate the sum of elements and divide by element count
median: First, you have to sort the list elements in increasing order. Then find out whether the list length is even or odd. If odd, return the center element. If even, return center element and the element next to the the center and calculate their average.
mode: Create a 'helper' list first containing distinct elements of the input list. Then, create a function that has one parameter: a number to - to be counted - how many times it is in the input list. Run this function in a for cycle providing the distinct list elements as input. At each iteration, save the result in a tuple: a tuple consists of (element value, element count). Afterall you should have an array of tuples. When you have all this stuff, select the tuple that has the maximum "element count" and return the corresponding "element value".
Please note that these are just fast hints that can be useful in order to create your own implementation based on the right algorithm you prefer. This could be a good exercise to get started with algorithms and data structures, I hope you'll not skip it:) Good luck!

How to create two dimensional set objects under pyomo.environ module

I tried to create a LP model by using pyomo.environ. However, I'm having a hard time on creating sets. For my problem, I have to create two sets. One set is from a bunch of nodes, and the other one is from several arcs between nodes. I create a network by using Networkx to store my nodes and arcs.
The node data is saved like (Longitude, Latitude) in tuple form. The arcs are saved as (nodeA, nodeB), where nodeA and nodeB are both coordinates in tuple.
So, a node is something like:
(-97.97516252657978, 30.342243012086083)
And, an arc is something like:
((-97.97516252657978, 30.342243012086083),
(-97.976196300350608, 30.34247219922803))
The way I tried to create a set is as following:
# import pyomo.envrion as pe
# create a model m
m = pe.ConcreteModel()
# network is an object I created by Networkx module
m.node_set = pe.Set(initialize= self.network.nodes())
m.arc_set = pe.Set(initialize= self.network.edges())
However, I kept getting an error message on arc_set.
ValueError: The value=(-97.97516252657978, 30.342243012086083,
-97.976196300350608, 30.34247219922803) does not have dimension=2,
which is needed for set=arc_set
I found it's weird that somehow my arc_set turned into one tuple instead of two. Then I tried to convert my nodes and arcs into string but still got the error.
Could somebody show me some hint? Or how do delete this bug?
Thanks!
Underneath the hood, Pyomo "flattens" all indexing sets. That is, it removes nested tuples so that each set member is a single tuple of scalar values. This is generally consistent with other algebraic modeling languages, and helps to make sure that we can consistently (and correctly) retrieve component members regardless of how the user attempted to query them.
In your case, Pyomo will want each member of the the arc set as a single 4-member tuple. There is a utility in PyUtilib that you can use to flatten your tuples when constructing the set:
from pyutilib.misc import flatten
m.arc_set = pe.Set(initialize=(tuple(flatten(x)) for x in self.network.edges())
You can also perform some error checking, in this case to make sure that all edges start and end at known nodes:
from pyutilib.misc import flatten
m.node_set = pe.Set( initialize=self.network.nodes() )
m.arc_set = pe.Set(
within=m.node_set*m.node_set,
initialize=(tuple(flatten(x)) for x in self.network.edges() )
This is particularly important for models like this where you are using floating point numbers as indices, and subtle round-off errors can produce indices that are nearly the same but not mathematically equal.
There has been some discussion among the developers to support both structured and flattened indices, but we have not quite reached consensus on how to best support it in a backwards compatible manner.

for every point in a list, compute the mean distance to all other points

I have a numpy array points of shape [N,2] which contains the (x,y) coordinates of N points. I'd like to compute the mean distance of every point to all other points using an existing function (which we'll call cmp_dist and which I just use as a black box).
First a verbose solution in "normal" python to illustrate what I want to do (written from the top of my head):
mean_dist = []
for i,(x0,y0) in enumerate(points):
dist = [
for j,(x1,y1) in enumerate(points):
if i==j: continue
dist.append(comp_dist(x0,y0,x1,y1))
mean_dist.append(np.array(dist).mean())
I already found a "better" solution using list comprehensions (assuming list comprehensions are usually better) which seems to work just fine:
mean_dist = [np.array([cmp_dist(x0,y0,x1,y1) for j,(x1,y1) in enumerate(points) if not i==j]).mean()
for i,(x0,y0) in enumerate(points)]
However, I'm sure there's a much better solution for this in pure numpy, hopefully some function that allows to do an operation for every element using all other elements.
How can I write this code in pure numpy/scipy?
I tried to find something myself, but this is quite hard to google without knowing how such operations are called (my respective math classes are quite a while back).
Edit: Not a duplicate of Fastest pairwise distance metric in python
The author of that question has a 1D array r and is satisfied with what scipy.spatial.distance.pdist(r, 'cityblock') returns (an array containing the distances between all points). However, pdist returns a flat array, that is, is is not clear which of the distances belong to which point (see my answer).
(Although, as explained in that answer, pdist is what I was ultimately looking for, it doesnt solve the problem as I've specified it in the question.)
Based on #ali_m's comment to the question ("Take a look at scipy.spatial.distance.pdist"), I found a "pure" numpy/scipy solution:
from scipy.spatial.distance import cdist
...
fct = lambda p0,p1: great_circle_distance(p0[0],p0[1],p1[0],p1[1])
mean_dist = np.sort(cdist(points,points,fct))[:,1:].mean(1)
definitely
That's for sure an improvement over my list comprehension "solution".
What i don't really like about this, though, is that I have to sort and slice the array to remove the 0.0 values which are the result of computing the distance between identical points (so basically that's my way of removing the diagonal entries of the matrix I get back from cdist).
Note two things about the above solution:
I'm using cdist, not pdist as suggested by #ali_m.
I'm getting back an array of the same size as points, which contains the mean distance from every point to all other points, just as specified in the original question.
pdist unfortunately just returns an array that contains all these mean values in a flat array, that is, the mean values are unlinked from the points they are referring to, which is necessary for the problem as it I've described it in the original question.
However, since in the actual problem at hand I only need the mean over the means of all points (which I did not mention in the question), pdist serves me just fine:
from scipy.spatial.distance import pdist
...
fct = lambda p0,p1: great_circle_distance(p0[0],p0[1],p1[0],p1[1])
mean_dist_overall = pdist(points,fct).mean()
Though this would for sure be the definite answer if I had asked for the mean of the means, but I've purposely asked for the array of means for all points. Because I think there's still room for improvement in the above cdist solution, I won't accept this as THE answer.

Find max non-infinity element in pytables CArray

This must be easy, but I'm very new to pytables. My application has dataset sizes so large they cannot be held in memory, thus I use PyTable CArrays. However, I need to find the maximum element in an array that is not infinity. Naively in numpy I'd do this:
max_element = numpy.max(array[array != numpy.inf])
Obviously that won't work in PyTables without introducing a whole array into memory. I could loop through the CArray in windows that fit in memory, but it'd be surprising to me if there weren't a max/min reduction operation. Is there an elegant mechanism to get the conditional maximum element of that array?
If your CArray is one dimensional, it is probably easier to stick it in a single-column Table. Then you have access to the where() method and can easily evaluate expressions like the following.
from itertools import imap
max(imap(lamdba r: r['col'], tab.where('col != np.inf')))
This works because where() never reads in all the data at once and returns an iterator, which is handed off to map, which is handed off to max. Note that in Python 3, you don't need to import imap() and imap() becomes just the builtin map().
Not using a table means that you need to use the Expr class and do more of the wiring yourself.

Efficiently Removing Very-Near-Duplicates From Python List

Background:My Python program handles relatively large quantities of data, which can be generated in-program, or imported. The data is then processed, and during one of these processes, the data is deliberately copied and then manipulated, cleaned for duplicates and then returned to the program for further use. The data I'm handling is very precise (up to 16 decimal places), and maintaining this accuracy to at least 14dp is vital. However, mathematical operations of course can return slight variations in my floats, such that two values are identical to 14dp, but may vary ever so slightly to 16dp, therefore meaning the built in set() function doesn't correctly remove such 'duplicates' (I used this method to prototype the idea, but it's not satisfactory for the finished program). I should also point out I may well be overlooking something simple! I am just interested to see what others come up with :)Question:What is the most efficient way to remove very-near-duplicates from a potentially very large data set?My Attempts:I have tried rounding the values themselves to 14dp, but this is of course not satisfactory as this leads to larger errors down the line. I have a potential solution to this problem, but I am not convinced it is as efficient or 'pythonic' as possible. My attempt involves finding the indices of list entries that match to x dp, and then removing one of the matching entries. Thank you in advance for any advice! Please let me know if there's anything you wish to be clarified, or of course if I'm overlooking something very simple (I may be at a point where I'm over-thinking it).Clarification on 'Duplicates':Example of one of my 'duplicate' entries: 603.73066958946424, 603.73066958946460, the solution would remove one of these values.Note on decimal.Decimal: This could work if it was guaranteed that all imported data did not already have some near-duplicates (which it often does).
You really want to use NumPy if you're handling large quantities of data. Here's how I would do it :
Import NumPy :
import numpy as np
Generate 8000 high-precision floats (128-bits will be enough for your purposes, but note that I'm converting the 64-bits output of random to 128 just to fake it. Use your real data here.) :
a = np.float128(np.random.random((8000,)))
Find the indexes of the unique elements in the rounded array :
_, unique = np.unique(a.round(decimals=14), return_index=True)
And take those indexes from the original (non-rounded) array :
no_duplicates = a[unique]
Why don't you create a dict that maps the 14dp values to the corresponding full 16dp values:
d = collections.defaultdict(list)
for x in l:
d[round(x, 14)].append(x)
Now if you just want "unique" (by your definition) values, you can do
unique = [v[0] for v in d.values()]

Categories