Checking closeness of coordinates in one group with another - python

I have two groups of coordinates:
{(x1,y1),..(xn,yn)}
{(w1,z1),..(wn,zn)}
and I would like to match each pair in group 2 to the pair in group 1 to which it is closest. My groups are large so the search needs to be efficient.
Any advice on setting this up would be appreciated. Moreover, if I instead had 2 groups with Group 1 = {(x1,y1,z1),..(xn,yn,zn)} and Group 2 = {(u1,v1, w1),..(un,vn,wn)}, how will my answer differ? Also, considering that my groups are too big to store on a computer, then, any suggestions for overcoming this issue would be appreciated.

You can use a KDTree: this algorithm allows to efficiently find the nearest neighbor significantly reducing the number of comparisons. The "KD" stands for "k-dimensional" meaning it can tackle data in an arbitrary number of dimensions (to answer your last question).
You can build a tree using one of the list and then for each element of the other list query for the nearest element. Scipy provides an implementation for ktrees.

Related

Find maximum total weight over set of pairs

I have a set of pairs of record IDs and for each pair a corresponding probability that these records actually belong to each other. Each pair is unique, but any given ID may be part of more than one pairing.
E.g.:
import pandas as pd
df = pd.DataFrame(
{'ID_1': [1,1,1,2],
'ID_2': [2,4,3,3],
'w': [0.5,0.5,0.6,0.7]}
)
df
ID_1 ID_2 w
0 1 2 0.5
1 1 4 0.5
2 1 3 0.6
3 2 3 0.7
(Note that not every ID has to be assigned to every other ID due to factors external to the problem. One could include those pairs and give them a probability of 0.)
How can I find the set of pairs where each ID is assigned to another ID not more than once (but an ID is allowed to not be assigned at all) such that the overall likelihood of pairs belonging to each other is maximized.
The dataframe I want to do this on is quite large, so setting this up as a maximum likelihood problem seems a bit over the top. I am not a computer scientist, but I thought there is probably an algorithm out there to solve this problem - optimally implemented in python.
The way I am doing it right now is in kind of a greedy way, which probably does not necessarily lead to the optimal solution. I start with the highest ranked pair. I put it into the final set and drop all pairs that involve any of the IDs from the set. I continue with the next lower ranked pair from the updated set in the same manner until there are no pairs left.
(Apologies if this is actually the wrong forum for this kind of question.)
One approach is to switch from using a column-row based model like you have with the data frames to using a Graph model. There are several python libraries that can do this including NetworkX. https://pypi.org/project/networkx/
The idea is each of your pairs becomes nodes in a graph, and then the edges are assigned the weights. Once you have that data structure, you can take any given node and find the highest weight edge. You can do all sorts of edge weight based path algorithms.
There is another python library: https://github.com/pgmpy/pgmpy which is built on networkx that will even be probability-aware. It might have what you need even more closely.
For this sort of query a graph library is oodles more efficient than trying to do it with row-column data structures.

How should I group these elements such that overall variance is minimized?

I have a set of elements, which is for example
x= [250,255,273,180,400,309,257,368,349,248,401,178,149,189,46,277,293,149,298,223]
I want to group these into n number of groups A,B,C... such that sum of all group variances is minimized. Each group need not have same number of elements.
I would like a optimization approach in python or R.
I would sort the numbers into increasing order and then use dynamic programming to work out where to place the boundaries between groups of contiguous elements. For example, if the only constraint is that every number must be in exactly one group, work from left to right. At each stage, for i=1..n work out the set of boundaries that produces minimum variance computed among the elements seen so far for i groups. For i=1 there is no choice. For i>1 consider every possible location for the boundary of the last group, and look up the previously computed answer for the best allocation of items before this boundary into i-1 groups, and use the figure previously computed here to work out the contribution of the variance of the previous i-1 groups.
(I haven't done the algebra, but I believe that if you have groups A and B where mean(A) < mean(B) but there are elements a in A and b in B such that a > b, you can reduce the variance by swapping these between groups. So the lower variance must come from groups that are contiguous when the elements are written out in sorted order).

Finding the closest possible values from two dictionaries

Let's suppose you have two existing dictionaries A and B
If you already choose an initial two items from dictionaries A and B with values A1 = 1.0 and B1 = 2.0, respectively, is there any way to find any two different existing items in the dictionaries A and B that each have different values (i.e. A2 and B2) from A1 and B1, and would also minimize the value (A2-A1)**2 + (B2-B1)**2?
The number of items in the dictionary is unfixed and could exceed 100,000.
Edit - This is important: the keys for A and B are the same, but the values corresponding to those keys in A and B are different. A particular choice of key will yield an ordered pair (A1,B1) that is different from any other possible order pair (A2,B2)—different keys have different order pairs. For example, both A and B will have the key 3,4 and this will yield a value of 1.0 for dict A and 2.0 for B. This one key will then be compared to every other key possible to find the other ordered pair (i.e. both the key and values of the items in A and B) that minimizes the squared differences between them.
You'll need a specialized data structure, not a standard Python dictionary. Look up quad-tree or kd-tree. You are effectively minimizing the Euclidean distance between two points (your objective function is just a square root away from Euclidean distance, and your dictionary A is storing x-coordinates, B y-coordinates.). Computational-geometry people have been studying this for years.
Well, maybe I am misreading your question and making it harder than it is. Are you saying that you can pick any value from A and any value from B, regardless of whether their keys are the same? For instance, the pick from A could be K:V (3,4):2.0, and the pick from B could be (5,6):3.0? Or does it have to be (3,4):2.0 from A and (3,4):6.0 from B? If the former, the problem is easy: just run through the values from A and find the closest to A1; then run through the values from B and find the closest to B1. If the latter, my first paragraph was the right answer.
Your comment says that the harder problem is the one you want to solve, so here is a little more. Sedgewick's slides explain how the static grid, the 2d-tree, and the quad-tree work. http://algs4.cs.princeton.edu/lectures/99GeometricSearch.pdf . Slides 15 through 29 explain mainly the 2d-tree, with 27 through 29 covering the solution to the nearest-neighbor problem. Since you have the constraint that the point the algorithm finds must share neither x- nor y-coordinate with the query point, you might have to implement the algorithm yourself or modify an existing implementation. One alternative strategy is to use a kNN data structure (k nearest neighbors, as opposed to the single nearest neighbor), experiment with k, and hope that your chosen k will always be large enough to find at least one neighbor that meets your constraint.

Keeping Track of Dynamic Programming Steps

I'm teaching myself basic programming principles, and I'm stuck on a dynamic programming problem. Let's take the infamous Knapsack Problem:
Given a set of items, each with a weight and a value, determine the count of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.
Let's set the weight limit to 10, and let's give two lists: weights = [2,4,7] and values = [8,4,9] (I just made these up). I can write the code to give the maximum value given the constraint--that's not a problem. But what about if I want to know what values I actually ended up using? Not the total value--the individual values. So for this example, the maximum would be the objects with weights 2 and 7, for a total value of 8 + 9 = 17. I don't want my answer to read "17" though--I want an output of a list like: (8, 9). It might be easy for this problem, but the problem I'm working on uses lists that are much bigger and that have repeat numbers (for example, multiple objects have a value of 8).
Let me know if anyone can think of anything. As always, much love and appreciation to the community.
Consider each partial solution a Node. Simply add whatever you use into each of these nodes and whichever node becomes the answer at the end will contain the set of items you used.
So each time you find a new solution you just set the list of items to the list of items of the new optimal solution and repeat for each.
A basic array implementation can help you keep track of what items enabled a new DP state to get it's value. For example, if your DP array is w[] then you can have another array p[]. Every time a state is generated for w[i], you set p[i] to the item you used to get to 'w[i]'. Then to output the list of items used for w[n], output p[n], and then move to the index n-weightOf(p[n]) until you reach 0 to output all the items.

Suppose I have 2 vectors. What algorithms can I use to compare them?

Company 1 has this vector:
['books','video','photography','food','toothpaste','burgers'] ... ...
Company 2 has this vector:
['video','processor','photography','LCD','power supply', 'books'] ... ...
Suppose this is a frequency distribution (I could make it a tuple but too much to type).
As you can see...these vectors have things that overlap. "video" and "photography" seem to be "similar" between two vectors due to the fact that they are in similar positions. And..."books" is obviously a strong point for company 1.
Ordering and positioning does matter, as this is a frequency distribution.
What algorithms could you use to play around with this? What algorithms could you use that could provide valuable data for these companies, using these vectors?
I am new to text-mining and information-retrieval. Could someone guide me about those topics in relation to this question?
Is position is very important, as you emphasize, then the crucial metric will be based on the difference of positions between the same items in the different vectors (you can, for example, sum the absolute values of the differences, or their squares). The big issue that needs to be solved is -- how much to weigh an item that's present (say it's the N-th one) in one vector, and completely absent in the other. Is that a relatively minor issue -- as if the missing item was actually present right after the actual ones, for example -- or a really, really big deal? That's impossible to say without more understanding of the actual application area. You can try various ways to deal with this issue and see what results they give on example cases you care about!
For example, suppose "a missing item is roughly the same as if it were present, right after the actual ones". Then, you can preprocess each input vector into a dict mapping item to position (crucial optimization if you have to compare many pairs of input vectors!):
def makedict(avector):
return dict((item, i) for i, item in enumerate(avector))
and then, to compare two such dicts:
def comparedicts(d1, d2):
allitems = set(d1) | set(d2)
distances = [d1.get(x, len(d1)) - d2.get(x, len(d2)) for x in allitems]
return sum(d * d for d in distances)
(or, abs(d) instead of the squaring in the last statement). To make missing items weigh more (make dicts, i.e. vectors, be considered further away), you could use twice the lengths instead of just the lengths, or some large constant such as 100, in an otherwise similarly structured program.
I would suggest you a book called Programming Collective Intelligence. It's a very nice book on how you can retrieve information from simple data like this one. There are code examples included (in Python :)
Edit:
Just replying to gbjbaanb: This is Python!
a = ['books','video','photography','food','toothpaste','burgers']
b = ['video','processor','photography','LCD','power supply', 'books']
a = set(a)
b = set(b)
a.intersection(b)
set(['photography', 'books', 'video'])
b.intersection(a)
set(['photography', 'books', 'video'])
b.difference(a)
set(['LCD', 'power supply', 'processor'])
a.difference(b)
set(['food', 'toothpaste', 'burgers'])
Take a look at Hamming Distance
As mbg mentioned, the hamming distance is a good start. It's basically assigning a bitmask for every possible item whether it is contained in the companies value.
Eg. toothpaste is 1 for company A, but 0 for company B. You then count the bits which differ between the companies. The Jaccard coefficient is related to this.
Hamming distance will actually not be able to capture similarity between things like "video" and "photography". Obviously, a company that sells one does sell the other also with higher probability than a company that sells toothpaste.
For this, you can use stuff like LSI (it's also used for dimensionality reduction) or factorial codes (e.g. neural network stuff as Restricted Boltzman Machines, Autoencoders or Predictablity Minimization) to get more compact representations which you can then compare using the euclidean distance.
pick the rank of each entry (higher rank is better) and make sum of geometric means between matches
for two vectors
sum(sqrt(vector_multiply(x,y))) //multiply matches
Sum of ranks for each value over vector should be same for each vector (preferrebly 1)
That way you can make compares between more than 2 vectors.
If you apply ikkebr's metod you can find how a is simmilar to b
in that case just use
sum( b( b.intersection(a) ))
You could use the set_intersection algorithm. The 2 vectors must be sorted first (use sort call), then pass in 4 iterators and you'll get a collection back with the common elements inserted into it. There are a few others that operate similarly.

Categories