I have a set of pairs of record IDs and for each pair a corresponding probability that these records actually belong to each other. Each pair is unique, but any given ID may be part of more than one pairing.
E.g.:
import pandas as pd
df = pd.DataFrame(
{'ID_1': [1,1,1,2],
'ID_2': [2,4,3,3],
'w': [0.5,0.5,0.6,0.7]}
)
df
ID_1 ID_2 w
0 1 2 0.5
1 1 4 0.5
2 1 3 0.6
3 2 3 0.7
(Note that not every ID has to be assigned to every other ID due to factors external to the problem. One could include those pairs and give them a probability of 0.)
How can I find the set of pairs where each ID is assigned to another ID not more than once (but an ID is allowed to not be assigned at all) such that the overall likelihood of pairs belonging to each other is maximized.
The dataframe I want to do this on is quite large, so setting this up as a maximum likelihood problem seems a bit over the top. I am not a computer scientist, but I thought there is probably an algorithm out there to solve this problem - optimally implemented in python.
The way I am doing it right now is in kind of a greedy way, which probably does not necessarily lead to the optimal solution. I start with the highest ranked pair. I put it into the final set and drop all pairs that involve any of the IDs from the set. I continue with the next lower ranked pair from the updated set in the same manner until there are no pairs left.
(Apologies if this is actually the wrong forum for this kind of question.)
One approach is to switch from using a column-row based model like you have with the data frames to using a Graph model. There are several python libraries that can do this including NetworkX. https://pypi.org/project/networkx/
The idea is each of your pairs becomes nodes in a graph, and then the edges are assigned the weights. Once you have that data structure, you can take any given node and find the highest weight edge. You can do all sorts of edge weight based path algorithms.
There is another python library: https://github.com/pgmpy/pgmpy which is built on networkx that will even be probability-aware. It might have what you need even more closely.
For this sort of query a graph library is oodles more efficient than trying to do it with row-column data structures.
Related
I have two groups of coordinates:
{(x1,y1),..(xn,yn)}
{(w1,z1),..(wn,zn)}
and I would like to match each pair in group 2 to the pair in group 1 to which it is closest. My groups are large so the search needs to be efficient.
Any advice on setting this up would be appreciated. Moreover, if I instead had 2 groups with Group 1 = {(x1,y1,z1),..(xn,yn,zn)} and Group 2 = {(u1,v1, w1),..(un,vn,wn)}, how will my answer differ? Also, considering that my groups are too big to store on a computer, then, any suggestions for overcoming this issue would be appreciated.
You can use a KDTree: this algorithm allows to efficiently find the nearest neighbor significantly reducing the number of comparisons. The "KD" stands for "k-dimensional" meaning it can tackle data in an arbitrary number of dimensions (to answer your last question).
You can build a tree using one of the list and then for each element of the other list query for the nearest element. Scipy provides an implementation for ktrees.
I have 50 products. For each product, I want to identify the following four related products using similarity measures.
1 related the most
2 partially related
1 not related
I want to compare the ranked list generated by my model (predicted) with the ranked list specified by the domain experts (ground truth).
Through reading, I found that I may use rank correlation based approaches such as Kendall Tau/Spearmen to compare the ranked lists. However, I am not sure if these approaches are suitable as my number of samples is low (4). Please correct me if I am wrong.
Another approach is to use Jaccard similarity (set intersection) to quantify the similarity between two ranked list. Then, I may plot histogram from the setbased_list (see below).
for index, row in evaluate.iterrows():
d= row['Id']
y_pred = [3,2,1,0]
y_true = [row['A'],row['B'],row['C'],row['D']]
sim = jaccard_similarity_score(y_true, y_pred)
setbased_list.append(sim)
Is my approach to the problem above correct?
What are other approaches that I may use if I want to take into consideration the positions of elements in the list (weight-based)?
From the way you have described the problem, it sounds as if you might as well just assign an arbitrary score for each item on your list - e.g. 3 points for the same item at the same rank as on the 'training' list, 1 point for the same item but at a different rank, or something like that.
I'm not clear on the role of the 'not related' item though - are the other 45 items all equally 'not related' to the target item and if so does it matter which one you choose? Perhaps you need to take points away from the score if the 'not related' item appears in one of the 'related' positions? That subtlety might not be captured by a standard nonparametric correlation measure.
If it's important that you use a standard, statistically based measure for some reason then you are probably better off asking on Cross Validated.
Let's suppose you have two existing dictionaries A and B
If you already choose an initial two items from dictionaries A and B with values A1 = 1.0 and B1 = 2.0, respectively, is there any way to find any two different existing items in the dictionaries A and B that each have different values (i.e. A2 and B2) from A1 and B1, and would also minimize the value (A2-A1)**2 + (B2-B1)**2?
The number of items in the dictionary is unfixed and could exceed 100,000.
Edit - This is important: the keys for A and B are the same, but the values corresponding to those keys in A and B are different. A particular choice of key will yield an ordered pair (A1,B1) that is different from any other possible order pair (A2,B2)—different keys have different order pairs. For example, both A and B will have the key 3,4 and this will yield a value of 1.0 for dict A and 2.0 for B. This one key will then be compared to every other key possible to find the other ordered pair (i.e. both the key and values of the items in A and B) that minimizes the squared differences between them.
You'll need a specialized data structure, not a standard Python dictionary. Look up quad-tree or kd-tree. You are effectively minimizing the Euclidean distance between two points (your objective function is just a square root away from Euclidean distance, and your dictionary A is storing x-coordinates, B y-coordinates.). Computational-geometry people have been studying this for years.
Well, maybe I am misreading your question and making it harder than it is. Are you saying that you can pick any value from A and any value from B, regardless of whether their keys are the same? For instance, the pick from A could be K:V (3,4):2.0, and the pick from B could be (5,6):3.0? Or does it have to be (3,4):2.0 from A and (3,4):6.0 from B? If the former, the problem is easy: just run through the values from A and find the closest to A1; then run through the values from B and find the closest to B1. If the latter, my first paragraph was the right answer.
Your comment says that the harder problem is the one you want to solve, so here is a little more. Sedgewick's slides explain how the static grid, the 2d-tree, and the quad-tree work. http://algs4.cs.princeton.edu/lectures/99GeometricSearch.pdf . Slides 15 through 29 explain mainly the 2d-tree, with 27 through 29 covering the solution to the nearest-neighbor problem. Since you have the constraint that the point the algorithm finds must share neither x- nor y-coordinate with the query point, you might have to implement the algorithm yourself or modify an existing implementation. One alternative strategy is to use a kNN data structure (k nearest neighbors, as opposed to the single nearest neighbor), experiment with k, and hope that your chosen k will always be large enough to find at least one neighbor that meets your constraint.
So I have a dictionary with key:value(tuple). Something like this. {"name":(4,5),....} where
(4,5) represents two categories (cat1, cat2). Given a maximum number for the second category, I would like to find the optimal combination of dictionary entries such that the 1st category is maximized or minimized.
For example, if maxCat2 = 15, I want to find some combination of entries from the dictionary such that, when I add the cat2 values from each entry together, I'm under 15. There may be many such conditions. Of these possibilities, I would like to pick the one that when I add up the values for cat1 for each entry it is larger than any of the other possibilities.
I thought about writing an algorithm to get all permutations of the entries in the dictionary and then see if each one meets the maxCat2 criteria and then see which one of those gives me the largest total cat1 value. If I have 20 entries, that means I would check 20! combinations, which is a very large number. Is there anything that I can do to avoid this? Thanks.
As Jochen Ritzel pointed out, this can be seen as an instance of the knapsack problem.
Typically, you have a set of objects that have both "weight" (the "second category", in your example) and "value" (or "cost", if it is a minimization problem).
The problem consists in picking a subset of the objects such that the sum of their "values" is maximized/minimized, subject to the constraint the sum of the weights cannot exceed a specified maximum.
Though the problem is intractable in general, if the constraint on the maximum value for the sum of weights is fixed, there exists a polynomial time solution using dynamic programming or memoization.
Very broadly, the idea is to define a set of values where
Cij denotes the maximum sum (of "values") attainable considering only the first i objects where the total weight (of the chosen subset) cannot exceed j.
There are two possible choices here to calculate Cij .
either element i is included in the subset and then
Cij = valuei + Ci-1,j-weighti
or element i is not in the subset of chosen objects, so
Cij = Ci-1,j
The maximum of the two needs to be picked.
If n is the number of elements and w is the maximum total weight, then the answer ends up in Cnw.
Company 1 has this vector:
['books','video','photography','food','toothpaste','burgers'] ... ...
Company 2 has this vector:
['video','processor','photography','LCD','power supply', 'books'] ... ...
Suppose this is a frequency distribution (I could make it a tuple but too much to type).
As you can see...these vectors have things that overlap. "video" and "photography" seem to be "similar" between two vectors due to the fact that they are in similar positions. And..."books" is obviously a strong point for company 1.
Ordering and positioning does matter, as this is a frequency distribution.
What algorithms could you use to play around with this? What algorithms could you use that could provide valuable data for these companies, using these vectors?
I am new to text-mining and information-retrieval. Could someone guide me about those topics in relation to this question?
Is position is very important, as you emphasize, then the crucial metric will be based on the difference of positions between the same items in the different vectors (you can, for example, sum the absolute values of the differences, or their squares). The big issue that needs to be solved is -- how much to weigh an item that's present (say it's the N-th one) in one vector, and completely absent in the other. Is that a relatively minor issue -- as if the missing item was actually present right after the actual ones, for example -- or a really, really big deal? That's impossible to say without more understanding of the actual application area. You can try various ways to deal with this issue and see what results they give on example cases you care about!
For example, suppose "a missing item is roughly the same as if it were present, right after the actual ones". Then, you can preprocess each input vector into a dict mapping item to position (crucial optimization if you have to compare many pairs of input vectors!):
def makedict(avector):
return dict((item, i) for i, item in enumerate(avector))
and then, to compare two such dicts:
def comparedicts(d1, d2):
allitems = set(d1) | set(d2)
distances = [d1.get(x, len(d1)) - d2.get(x, len(d2)) for x in allitems]
return sum(d * d for d in distances)
(or, abs(d) instead of the squaring in the last statement). To make missing items weigh more (make dicts, i.e. vectors, be considered further away), you could use twice the lengths instead of just the lengths, or some large constant such as 100, in an otherwise similarly structured program.
I would suggest you a book called Programming Collective Intelligence. It's a very nice book on how you can retrieve information from simple data like this one. There are code examples included (in Python :)
Edit:
Just replying to gbjbaanb: This is Python!
a = ['books','video','photography','food','toothpaste','burgers']
b = ['video','processor','photography','LCD','power supply', 'books']
a = set(a)
b = set(b)
a.intersection(b)
set(['photography', 'books', 'video'])
b.intersection(a)
set(['photography', 'books', 'video'])
b.difference(a)
set(['LCD', 'power supply', 'processor'])
a.difference(b)
set(['food', 'toothpaste', 'burgers'])
Take a look at Hamming Distance
As mbg mentioned, the hamming distance is a good start. It's basically assigning a bitmask for every possible item whether it is contained in the companies value.
Eg. toothpaste is 1 for company A, but 0 for company B. You then count the bits which differ between the companies. The Jaccard coefficient is related to this.
Hamming distance will actually not be able to capture similarity between things like "video" and "photography". Obviously, a company that sells one does sell the other also with higher probability than a company that sells toothpaste.
For this, you can use stuff like LSI (it's also used for dimensionality reduction) or factorial codes (e.g. neural network stuff as Restricted Boltzman Machines, Autoencoders or Predictablity Minimization) to get more compact representations which you can then compare using the euclidean distance.
pick the rank of each entry (higher rank is better) and make sum of geometric means between matches
for two vectors
sum(sqrt(vector_multiply(x,y))) //multiply matches
Sum of ranks for each value over vector should be same for each vector (preferrebly 1)
That way you can make compares between more than 2 vectors.
If you apply ikkebr's metod you can find how a is simmilar to b
in that case just use
sum( b( b.intersection(a) ))
You could use the set_intersection algorithm. The 2 vectors must be sorted first (use sort call), then pass in 4 iterators and you'll get a collection back with the common elements inserted into it. There are a few others that operate similarly.