Trying to match an exact subset in python - python

I'm trying to see if a set in python contains the elements of another set. I've tried to use set comparison but the problem is I need to be able to recognise only an exact match of elements. For example a subset of [3,3] will match a set of [3,1,2] when I want it to only match to [3,3,1], or any set variant with two threes.
I am iterating through all possible variants of a 3 element set using numbers 0-4, trying to see which ones contain the [3,3] set. Should I be using sets or is it better to use a list? Any ideas on how to do this?
Cheers

sets cannot contain duplicate elements. you can use a list or a dict where the value for the key is the number of times the key occurs in your set.
Something like:
d1 = {3:2, 1:1}
d2 = {3:2}
all(d1.get(k,0)-v>=0 for (k,v) in d2.items())

Assuming that by set you mean list, something like this should work (not tested):
def contains(superset, subset):
for elem in set(superset):
if superset.count(elem) < subset.count(elem):
return False
return True

If you want to have duplicates in your sets (making them really multisets or "bags", rather than proper sets), you should use collections.Counter. It supports set operations (&, +, -) with appropriate multiset semantics.
Testing if one multiset a is a subset of another multiset b can be done with a == a & b:
from collections import Counter
a = Counter([3,3])
b = Counter([3,1,2])
print(a == a & b) # prints False, since a is not a subset of b

Related

Compare two lists with custom functions and custom comparison

I have two lists
a = [1,2,3]
b = [2,3,4,5]
and two custom functions F1(a[i]) and F2(b[j]), which take elements from lists a and b return objects, say A and B. And there is a custom comparison function diff(A,B), which returns False if the objects are the same in some sense and True if they are different.
I would like to compare these two lists in a pythonic way to see if the lists are the same. In other words, that all objects A generated from list a have at least one equal object B generated from list b. For example if the outcome is the following then the lists are the same:
(diff(F1(1),F2(4)) or diff(F1(1),F2(5)) or diff(F1(2),F2(3)) or diff(F1(3),F2(2))) is False
For the function sorted there is a key function. Are there any comparison functions in Python that can take a custom function in this situation?
The only solution that I see is to loop through all elements in a and loop through all elements in b to check them element by element. But then if I want to extend the functionality this would require quite some development. If there is a way to use standard Python functions this would be very much appreciated.
all(True if any(diff(A, B) is False for B in [F2(j) for j in b]) else False for A in [F1(i) for i in a])
Honestly I just typed this quickly using my Python auto-pilot but it would look something like this I guess.
Probably the best you can get is first converting all entries of the list, and then iterate over both and check if any of a has a match in b and if that holds true for all as. So something like
af = [F1(i) for i in a]
bf = [F2(i) for i in b]
is_equal = all(any(not diff(a,b) for b in bf) for a in af )
all stops as soon as it finds the first value that is false, any stops as soon as it finds the first value that is True.

numpy.unique gives wrong output for list of sets

I have a list of sets given by,
sets1 = [{1},{2},{1}]
When I find the unique elements in this list using numpy's unique, I get
np.unique(sets1)
Out[18]: array([{1}, {2}, {1}], dtype=object)
As can be seen seen, the result is wrong as {1} is repeated in the output.
When I change the order in the input by making similar elements adjacent, this doesn't happen.
sets2 = [{1},{1},{2}]
np.unique(sets2)
Out[21]: array([{1}, {2}], dtype=object)
Why does this occur? Or is there something wrong in the way I have done?
What happens here is that the np.unique function is based on the np._unique1d function from NumPy (see the code here), which itself uses the .sort() method.
Now, sorting a list of sets that contain only one integer in each set will not result in a list with each set ordered by the value of the integer present in the set. So we will have (and that is not what we want):
sets = [{1},{2},{1}]
sets.sort()
print(sets)
# > [{1},{2},{1}]
# ie. the list has not been "sorted" like we want it to
Now, as you have pointed out, if the list of sets is already ordered in the way you want, np.unique will work (since you would have sorted the list beforehand).
One specific solution (though, please be aware that it will only work for a list of sets that each contain a single integer) would then be:
np.unique(sorted(sets, key=lambda x: next(iter(x))))
That is because set is unhashable type
{1} is {1} # will give False
you can use python collections.Counter if you can can convert the set to tuple like below
from collections import Counter
sets1 = [{1},{2},{1}]
Counter([tuple(a) for a in sets1])

Why is the output "cbe" rather than "bce" in this python program?

getDifference=lambda string1, string2: reduce((lambda character1, character2: character1+character2), (set(string1)-set(string2)))
print getDifference('abcde','adf')
In the first line, I defined a lambda expression that finds the difference between string1 and string2. I assume the output should be "bce", but it is "cbe", why?
A set is an unordered collection of unique elements - so the order of the characters is not kept through the sets operation. Check here for more:
http://docs.python.org/2/tutorial/datastructures.html#sets
Python sets don't preserve ordering. To preserve the order, you can use a list comprehension:
def diff(s1, s2):
return "".join([c for c in s1 if c not in s2])
diff('abcde','adf') # 'bce'
Many set implementations use hashing, which mixed up the order of the elements as a side effect. Ordered set implementations such as treesets do exist, however.

Pythonic way of saying "if all of the elements in list 1 also exist in list 2"

I want to return true from the if statement only if all of the elements from list 1 also exist in list 2 (list 2 is a superset of list 1). What is the most pythonic way of writing this?
You can use set operations:
if set(list1) <= set(list2):
#...
Note that the comparison itself is fast, but converting the lists to sets might not (depends on the size of the lists).
Converting to a set also removes any duplicate. So if you have duplicate elements and want to ensure that they are also duplicates in the other list, using sets will not work.
You can use built-in all() function:
if all(x in sLVals for x in fLVals):
# do something
In case of using sets think you can take a look at difference method as far as i know it is quite faster way:
if set(fLVals).difference(sLVals):
# there is a difference
else:
# no difference
Either set.issuperset or all(x in L2 for x in L1).
This one came straight out of good folks at MIT:
from operator import and_
reduce(and_, [x in b for x in a])
I tried to find the "readings.pdf" they had posted for the 6.01 class about a year ago...but I can't find it anymore.
Head to my profile and send me an email, and I'll send you the .pdf where I got this example. It's a very good book, but it doesn't seem to be a part of the class anymore.

Compare DB row values efficiently

I want to loop through a database of documents and calculate a pairwise comparison score.
A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.
Is there a name for the algorithm for doing this task efficiently?
Is there a name for this approach?
Thanks.
Assume all items have a number ItemNumber
Simple solution -- always have the 2nd element's ItemNumber greater than the first item.
eg
for (firstitem = 1 to maxitemnumber)
for (seconditem = firstitemnumber+1 to maxitemnumber)
compare(firstitem, seconditem)
visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.
........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.
I don't think its complicated enough to qualify for a name.
You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.
Unique pairings:
SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id
Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.
C.
You can keep track of which documents you have already compared, e.g. (with numbers ;))
compared = set()
for i in [1,2,3]:
for j in [1,2,3]:
pair = frozenset((i,j))
if i != k and pair not in compared:
compare.add(pair)
compare(i,j)
Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.
Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:
docs = [1,2,3]
l = len(docs)
for i in range(l):
for j in range(i+1,l):
compare(l[i],l[j])
Something like this?
src = [1,2,3]
for i, x in enumerate(src):
for y in src[i:]:
compare(x, y)
Or you might wish to generate a list of pairs instead:
pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]

Categories