Let's suppose you have two existing dictionaries A and B
If you already choose an initial two items from dictionaries A and B with values A1 = 1.0 and B1 = 2.0, respectively, is there any way to find any two different existing items in the dictionaries A and B that each have different values (i.e. A2 and B2) from A1 and B1, and would also minimize the value (A2-A1)**2 + (B2-B1)**2?
The number of items in the dictionary is unfixed and could exceed 100,000.
Edit - This is important: the keys for A and B are the same, but the values corresponding to those keys in A and B are different. A particular choice of key will yield an ordered pair (A1,B1) that is different from any other possible order pair (A2,B2)—different keys have different order pairs. For example, both A and B will have the key 3,4 and this will yield a value of 1.0 for dict A and 2.0 for B. This one key will then be compared to every other key possible to find the other ordered pair (i.e. both the key and values of the items in A and B) that minimizes the squared differences between them.
You'll need a specialized data structure, not a standard Python dictionary. Look up quad-tree or kd-tree. You are effectively minimizing the Euclidean distance between two points (your objective function is just a square root away from Euclidean distance, and your dictionary A is storing x-coordinates, B y-coordinates.). Computational-geometry people have been studying this for years.
Well, maybe I am misreading your question and making it harder than it is. Are you saying that you can pick any value from A and any value from B, regardless of whether their keys are the same? For instance, the pick from A could be K:V (3,4):2.0, and the pick from B could be (5,6):3.0? Or does it have to be (3,4):2.0 from A and (3,4):6.0 from B? If the former, the problem is easy: just run through the values from A and find the closest to A1; then run through the values from B and find the closest to B1. If the latter, my first paragraph was the right answer.
Your comment says that the harder problem is the one you want to solve, so here is a little more. Sedgewick's slides explain how the static grid, the 2d-tree, and the quad-tree work. http://algs4.cs.princeton.edu/lectures/99GeometricSearch.pdf . Slides 15 through 29 explain mainly the 2d-tree, with 27 through 29 covering the solution to the nearest-neighbor problem. Since you have the constraint that the point the algorithm finds must share neither x- nor y-coordinate with the query point, you might have to implement the algorithm yourself or modify an existing implementation. One alternative strategy is to use a kNN data structure (k nearest neighbors, as opposed to the single nearest neighbor), experiment with k, and hope that your chosen k will always be large enough to find at least one neighbor that meets your constraint.
Related
I have a set of pairs of record IDs and for each pair a corresponding probability that these records actually belong to each other. Each pair is unique, but any given ID may be part of more than one pairing.
E.g.:
import pandas as pd
df = pd.DataFrame(
{'ID_1': [1,1,1,2],
'ID_2': [2,4,3,3],
'w': [0.5,0.5,0.6,0.7]}
)
df
ID_1 ID_2 w
0 1 2 0.5
1 1 4 0.5
2 1 3 0.6
3 2 3 0.7
(Note that not every ID has to be assigned to every other ID due to factors external to the problem. One could include those pairs and give them a probability of 0.)
How can I find the set of pairs where each ID is assigned to another ID not more than once (but an ID is allowed to not be assigned at all) such that the overall likelihood of pairs belonging to each other is maximized.
The dataframe I want to do this on is quite large, so setting this up as a maximum likelihood problem seems a bit over the top. I am not a computer scientist, but I thought there is probably an algorithm out there to solve this problem - optimally implemented in python.
The way I am doing it right now is in kind of a greedy way, which probably does not necessarily lead to the optimal solution. I start with the highest ranked pair. I put it into the final set and drop all pairs that involve any of the IDs from the set. I continue with the next lower ranked pair from the updated set in the same manner until there are no pairs left.
(Apologies if this is actually the wrong forum for this kind of question.)
One approach is to switch from using a column-row based model like you have with the data frames to using a Graph model. There are several python libraries that can do this including NetworkX. https://pypi.org/project/networkx/
The idea is each of your pairs becomes nodes in a graph, and then the edges are assigned the weights. Once you have that data structure, you can take any given node and find the highest weight edge. You can do all sorts of edge weight based path algorithms.
There is another python library: https://github.com/pgmpy/pgmpy which is built on networkx that will even be probability-aware. It might have what you need even more closely.
For this sort of query a graph library is oodles more efficient than trying to do it with row-column data structures.
For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.
I have two groups of coordinates:
{(x1,y1),..(xn,yn)}
{(w1,z1),..(wn,zn)}
and I would like to match each pair in group 2 to the pair in group 1 to which it is closest. My groups are large so the search needs to be efficient.
Any advice on setting this up would be appreciated. Moreover, if I instead had 2 groups with Group 1 = {(x1,y1,z1),..(xn,yn,zn)} and Group 2 = {(u1,v1, w1),..(un,vn,wn)}, how will my answer differ? Also, considering that my groups are too big to store on a computer, then, any suggestions for overcoming this issue would be appreciated.
You can use a KDTree: this algorithm allows to efficiently find the nearest neighbor significantly reducing the number of comparisons. The "KD" stands for "k-dimensional" meaning it can tackle data in an arbitrary number of dimensions (to answer your last question).
You can build a tree using one of the list and then for each element of the other list query for the nearest element. Scipy provides an implementation for ktrees.
So I have a dictionary with key:value(tuple). Something like this. {"name":(4,5),....} where
(4,5) represents two categories (cat1, cat2). Given a maximum number for the second category, I would like to find the optimal combination of dictionary entries such that the 1st category is maximized or minimized.
For example, if maxCat2 = 15, I want to find some combination of entries from the dictionary such that, when I add the cat2 values from each entry together, I'm under 15. There may be many such conditions. Of these possibilities, I would like to pick the one that when I add up the values for cat1 for each entry it is larger than any of the other possibilities.
I thought about writing an algorithm to get all permutations of the entries in the dictionary and then see if each one meets the maxCat2 criteria and then see which one of those gives me the largest total cat1 value. If I have 20 entries, that means I would check 20! combinations, which is a very large number. Is there anything that I can do to avoid this? Thanks.
As Jochen Ritzel pointed out, this can be seen as an instance of the knapsack problem.
Typically, you have a set of objects that have both "weight" (the "second category", in your example) and "value" (or "cost", if it is a minimization problem).
The problem consists in picking a subset of the objects such that the sum of their "values" is maximized/minimized, subject to the constraint the sum of the weights cannot exceed a specified maximum.
Though the problem is intractable in general, if the constraint on the maximum value for the sum of weights is fixed, there exists a polynomial time solution using dynamic programming or memoization.
Very broadly, the idea is to define a set of values where
Cij denotes the maximum sum (of "values") attainable considering only the first i objects where the total weight (of the chosen subset) cannot exceed j.
There are two possible choices here to calculate Cij .
either element i is included in the subset and then
Cij = valuei + Ci-1,j-weighti
or element i is not in the subset of chosen objects, so
Cij = Ci-1,j
The maximum of the two needs to be picked.
If n is the number of elements and w is the maximum total weight, then the answer ends up in Cnw.
Company 1 has this vector:
['books','video','photography','food','toothpaste','burgers'] ... ...
Company 2 has this vector:
['video','processor','photography','LCD','power supply', 'books'] ... ...
Suppose this is a frequency distribution (I could make it a tuple but too much to type).
As you can see...these vectors have things that overlap. "video" and "photography" seem to be "similar" between two vectors due to the fact that they are in similar positions. And..."books" is obviously a strong point for company 1.
Ordering and positioning does matter, as this is a frequency distribution.
What algorithms could you use to play around with this? What algorithms could you use that could provide valuable data for these companies, using these vectors?
I am new to text-mining and information-retrieval. Could someone guide me about those topics in relation to this question?
Is position is very important, as you emphasize, then the crucial metric will be based on the difference of positions between the same items in the different vectors (you can, for example, sum the absolute values of the differences, or their squares). The big issue that needs to be solved is -- how much to weigh an item that's present (say it's the N-th one) in one vector, and completely absent in the other. Is that a relatively minor issue -- as if the missing item was actually present right after the actual ones, for example -- or a really, really big deal? That's impossible to say without more understanding of the actual application area. You can try various ways to deal with this issue and see what results they give on example cases you care about!
For example, suppose "a missing item is roughly the same as if it were present, right after the actual ones". Then, you can preprocess each input vector into a dict mapping item to position (crucial optimization if you have to compare many pairs of input vectors!):
def makedict(avector):
return dict((item, i) for i, item in enumerate(avector))
and then, to compare two such dicts:
def comparedicts(d1, d2):
allitems = set(d1) | set(d2)
distances = [d1.get(x, len(d1)) - d2.get(x, len(d2)) for x in allitems]
return sum(d * d for d in distances)
(or, abs(d) instead of the squaring in the last statement). To make missing items weigh more (make dicts, i.e. vectors, be considered further away), you could use twice the lengths instead of just the lengths, or some large constant such as 100, in an otherwise similarly structured program.
I would suggest you a book called Programming Collective Intelligence. It's a very nice book on how you can retrieve information from simple data like this one. There are code examples included (in Python :)
Edit:
Just replying to gbjbaanb: This is Python!
a = ['books','video','photography','food','toothpaste','burgers']
b = ['video','processor','photography','LCD','power supply', 'books']
a = set(a)
b = set(b)
a.intersection(b)
set(['photography', 'books', 'video'])
b.intersection(a)
set(['photography', 'books', 'video'])
b.difference(a)
set(['LCD', 'power supply', 'processor'])
a.difference(b)
set(['food', 'toothpaste', 'burgers'])
Take a look at Hamming Distance
As mbg mentioned, the hamming distance is a good start. It's basically assigning a bitmask for every possible item whether it is contained in the companies value.
Eg. toothpaste is 1 for company A, but 0 for company B. You then count the bits which differ between the companies. The Jaccard coefficient is related to this.
Hamming distance will actually not be able to capture similarity between things like "video" and "photography". Obviously, a company that sells one does sell the other also with higher probability than a company that sells toothpaste.
For this, you can use stuff like LSI (it's also used for dimensionality reduction) or factorial codes (e.g. neural network stuff as Restricted Boltzman Machines, Autoencoders or Predictablity Minimization) to get more compact representations which you can then compare using the euclidean distance.
pick the rank of each entry (higher rank is better) and make sum of geometric means between matches
for two vectors
sum(sqrt(vector_multiply(x,y))) //multiply matches
Sum of ranks for each value over vector should be same for each vector (preferrebly 1)
That way you can make compares between more than 2 vectors.
If you apply ikkebr's metod you can find how a is simmilar to b
in that case just use
sum( b( b.intersection(a) ))
You could use the set_intersection algorithm. The 2 vectors must be sorted first (use sort call), then pass in 4 iterators and you'll get a collection back with the common elements inserted into it. There are a few others that operate similarly.