Comparing sets that contain nan in python - python

I'm trying to compare two sets in python that contain nan but struggling to do so because {float('nan')} != {float('nan')}. For example:
s1 = {float('nan'), 1}
s2 = {float('nan'), 1, 2}
assert set.issubset(s1, s2)
And I get an assertion error. How can I handle this?

One approach: identity is tested before equality (see here in the docs, for example), so it'd work if you use the same nan:
>>> nan = float("nan")
>>> s1 = {nan, 1}
>>> s2 = {nan, 1, 2}
>>> set.issubset(s1, s2)
True
even though
>>> s1 = {float("nan"), 1}
>>> s2 = {float("nan"), 1, 2}
>>> set.issubset(s1, s2)
False
Working with nans is awkward enough that I'd try to avoid putting them in sets and switch to a different canonical form. But you could always just make sure it's the same one:
>>> def one_nan(x, nan=float("nan")):
... return nan if math.isnan(x) else x
...
>>> set.issubset(set(map(one_nan, s1)), set(map(one_nan, s2)))
True
or a thousand variants on the same. (I sometimes use x != x as a shortcut for nan-detection but it's probably a good idea to be explicit here.)

You could also write a simple function for this. Note that float('nan') == float('nan') is False for nan; to check if any element is nan, we just have to compare it with itself.
def is_subset(s1, s2):
no_nan_set = lambda s: {x for x in s if x == x}
s1_nan, s2_nan = no_nan_set(s1), no_nan_set(s2)
if s1_nan != s1 and s2_nan != s2:
return s1_nan.issubset(s2_nan)
elif s1_nan == s1 and s2_nan == s2:
return s1.issubset(s2)
else:
return False
You can simplify the if-elif-else block
def is_subset(s1, s2):
no_nan_set = lambda s: {x for x in s if x == x}
s1_nan, s2_nan = no_nan_set(s1), no_nan_set(s2)
return (s1_nan != s1 and s2_nan != s2 and s1_nan.issubset(s2_nan)) \
or (s1_nan == s1 and s2_nan == s2 and s1.issubset(s2))
Note that if either of your set has two or more nans (because float('nan') != float('nan')), this will work correctly, and similarly it will work all right if the ids of the nans are different. And lastly, this will work even if you don't have the nans in one or both of your set.

Create temporary sets with all nan values removed, and compare those instead. Afterwards, handle the nan comparison separately. For example, you could check if both of the original sets contain nan.
Even if you could perform the comparison for your sets without the assertion exception, float('nan') == float('nan') will return False so there is little value gained from this set comparison (it will invalidate the rest of the comparison). You can check this behavior by printing set.issubset.
s1 = frozenset({float('nan'), 1})
s2 = frozenset({float('nan'), 1, 2})
print frozenset.issubset(s1,s2)
which prints False.
Although set is deprecated, you may generate the temporary sets as follows:
s3 = set([value for value in s1 if not math.isnan(value)])
(repeat for each temporary set as needed)

Related

How to return False when using issubset and an empty set

When I have two sets e.g.
s1 = set()
s2 = set(['somestring'])
and I do
print s1.issubset(s2)
it returns True; so apparently, an empty set is always a subset of another set.
For my analysis, it should actually return False and I am wondering about the best way to do this. I can write a function like this:
def check_set(s1, s2):
if s1 and s1.issubset(s2):
return True
return False
which then indeed returns False for the example above. Is there any better way of doing this?
I would do that like this:
s1 <= s2 if s1 else False
It should be faster, because it uses the built-in operators supported by sets rather than using more expensive function calls and attribute lookups. It's logically equivalent.
You can take advantage of how Python evaluates the truthiness of an object plus how it short-circuits boolean and expressions with:
bool(s1) and s1 <= s2
Essentially this means: if s1 is something not empty AND it's a subset of s2
Instead of using an if you can force the result to be a bool by doing this:
def check_set(s1, s2):
return bool(s1 and s1.issubset(s2))
Why not just return the value? That way, you avoid having to write return True or return False.
def check_set(s1, s2):
return bool(s1 and s1.issubset(s2))
Instead of using an empty set, you could use a set with an empty value:
s1 = set(['']) or s1 = set([None])
Then your print statement would work as you expected.

Why is (numpy.nan, 1) == (numpy.nan, 1)?

While numpy.nan is not equal to numpy.nan, and (float('nan'), 1) is not equal to float('nan', 1),
(numpy.nan, 1) == (numpy.nan, 1)
What could be the reason?
Does Python first check to see if the ids are identical?
If identity is checked first when comparing items of a tuple, then why isn't it checked when objects are compared directly?
When you do numpy.nan == numpy.nan it's numpy that is deciding whether the condition is true or not. When you compare tuples python is just checking if the tuples have the same objects which they do. You can make numpy have the decision by turning the tuples into numpy arrays.
np.array((1, numpy.nan)) == np.array((1,numpy.nan))
>>array([ True, False], dtype=bool)
The reason is when you do == with numpy objects you're calling the numpy function __eq__() that says specifically that nan != nan because mathematically speaking nan is undetermined (could be anything) so it makes sense that nan != nan. But when you do == with tuples you call the tuples __eq__() function that doesn't care about mathematics and only cares if python objects are the same or not. In case of (float('nan'),1)==(float('nan'),1) it returns False because each call of float('nan') allocates memory in a different place as you can check by doing float('nan') is float('nan').
Container objects are free to define what equality means for them, and for most that means one thing is really, really important:
for x in container:
assert x in container
So containers typically do an id check before an __eq__ check.
When comparing two objects in a tuple Python first check to see if they are the same.
Note that numpy.nan is numpy.nan, but float('nan') is not float('nan').
In Objects/tupleobject.c, the comparison is carried out like this:
for (i = 0; i < vlen && i < wlen; i++) {
int k = PyObject_RichCompareBool(vt->ob_item[i],
wt->ob_item[i], Py_EQ);
if (k < 0)
return NULL;
if (!k)
break;
}
And in PyObject_RichCompareBool, you can see the check for equality:
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}
You can verify this with the following example:
class A(object):
def __eq__(self, other):
print "Checking equality with __eq__"
return True
a1 = A()
a2 = A()
If you try (a1, 1) == (a1, 1) nothing get printed, while (a1, 1) == (a2, 1) would use __eq__ and print our the message.
Now try a1 == a1 and see if it surprises you ;P
Tuples do check first with identity and then with equality if identity doesn't match.
(float('nan'),) == (float('nan'),)
is False simply because a different object instance is created... if you do instead:
x = float('nan')
print (x,) == (x,)
you will get True too because x == x is False, but x is x is True.
Numpy numpy.nan is a static instance and that's why it "doesn't work".
As a wild guess this "shortcut" of checking identity first is done for performance reasons.

Difference between: IF IN and IF == python

I wanted to know which condition is better to use for the following code:
Here are my two lists:
Matrix = ['kys_q1a1','kys_q1a2','kys_q1a3','kys_q1a4','kys_q1a5','kys_q1a6']
fixedlist = ['kys_q1a2', 'kys_q1a5']
Option 1:
for i, topmember in enumerate(Matrix):
for fixedcol in fixedlist:
if topmember in fixedcol:
print i
OR
Option 2:
for i, topmember in enumerate(Matrix):
for fixedcol in fixedlist:
if topmember == fixedcol:
print i
I understand that the comparison opertor is matching strings but isn't 'in' doing the same?
Thanks
topmember in fixedcol
tests if the string topmember is contained within fixedcol.
topmember == fixedcol
tests if the string topmember is equal to fixedcol.
So, 'a' in 'ab' would evaluate True. But 'a' == 'ab' would evaluate False.
I wanted to know which condition is better to use.
Since the two variants perform different operations, we cannot answer that. You need to choose the option that does the operation that you require.
Your code could be simplified quite a bit. The second option could be reduced to:
for i, topmember in enumerate(Matrix):
if topmember in fixedlist:
print i
You could also use a list comprehension to find the matching indices:
[i for i, x in enumerate(Matrix) if x in fixedlist]
If you just have to print the indices rather than store them in a list you can write it like this:
print '\n'.join([str(i) for i, x in enumerate(Matrix) if x in fixedlist])
It's a matter of taste whether you prefer the dense list comprehension one-liner, or the rather more verbose version above.
Hi in opeartor is used for membership testing and == operator is used for equality testing .
Generally we used in for membership testing in sequence object. And is able to test in dictionary, set, tuple, list, string etc. But it behaves differently based on the object types.
Dictionary:
It check for the key exists.
>>> d = {'key' : 'value'}
>>> 'key' in d
True
>>> 'k' in d
False
>>>
Set:
Under the hood it checks for key is exist, set implementation is same as dictionary with some dummy value.
>>> s = set(range(10))
>>> 1 in s
True
>>>
List and Tuple:
For the list and tuple types, x in y is true if and only if there exists an index i such that x == y[i] is true.
>>> l = range(10)
>>> 3 in l
True
>>>
String:
checking whether the substring is present inside the string eg. x in y is true if and only if x is a substring of y. An equivalent test is y.find(x) != -1
Use defined data type:
user-defined classes which define the __contains__() method, x in y is true if and only if y.__contains__(x) is true.
class Person(object):
def __init__(self,name,age):
self.name = name
self.age = age
def __contains__(self, arg):
if arg in self.__dict__.keys():
return True
else:
return False
obj_p = Person('Jeff', 90)
print 'Jeff', 'Jeff' in obj_p
print 'age', 'age' in obj_p
print 'name', 'age' in obj_p
I Hope, you will clear some what is the usage of in.
Lets rewrite your snippet:
>>> Matrix = ['kys_q1a1','kys_q1a2','kys_q1a3','kys_q1a4','kys_q1a5','kys_q1a6']
>>> fixedlist = ['kys_q1a2', 'kys_q1a5']
>>> for i in fixedlist:
... print i, i in Matrix
...
kys_q1a2 True
kys_q1a5 True
>>>
And finally lets see some of the equality test: ==:
>>> 'a' == 'b'
False
>>> 'a' == 'a'
True
>>> 'a' == 'ab'
False
>>> '' in 'ab' # empty string is treated as a sub-string for any string
True
>>> '' == 'ab' # False as they are having different values
False
>>>
>>> 1 == 'ab'
False
>>> 1 == 1
True
>>>
Going with '==' is precise if you want to match exact string.

Why is this string comparison returning False? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
String comparison in Python: is vs. ==
algorithm = str(sys.argv[1])
print(algorithm)
print(algorithm is "first")
I'm running it from the command line with the argument first, so why does that code output:
first
False
From the Python documentation:
The operators is and is not test for object identity: x is y is true if and only if x and y are the same object.
This means it doesn't check if the values are the same, but rather checks if they are in the same memory location. For example:
>>> s1 = 'hello everybody'
>>> s2 = 'hello everybody'
>>> s3 = s1
Note the different memory locations:
>>> id(s1)
174699248
>>> id(s2)
174699408
But since s3 is equal to s1, the memory locations are the same:
>>> id(s3)
174699248
When you use the is statement:
>>> s1 is s2
False
>>> s3 is s1
True
>>> s3 is s2
False
But if you use the equality operator:
>>> s1 == s2
True
>>> s2 == s3
True
>>> s3 == s1
True
Edit: just to be confusing, there is an optimisation (in CPython anyway, I'm not sure if it exists in other implementations) which allows short strings to be compared with is:
>>> s4 = 'hello'
>>> s5 = 'hello'
>>> id(s4)
173899104
>>> id(s5)
173899104
>>> s4 is s5
True
Obviously, this is not something you want to rely on. Use the appropriate statement for the job - is if you want to compare identities, and == if you want to compare values.
You want:
algorithm = str(sys.argv[1])
print(algorithm)
print(algorithm == "first")
is checks for object identity (think memory address).
But in your case the the objects have the same "value", but are not the same objects.
Note that == is weaker than is.
This means that if is returns True, then == will also return True, but the reverse is not always true.
Basically, is checks object's address (identity), not value,. For value comparison use == operator

How to check if characters in a string are alphabetically ordered

I have been trying these code but there is something wrong. I simply want to know if the first string is alphabetical.
def alp(s1):
s2=sorted(s1)
if s2 is s1:
return True
else:
return False
This always prints False and when i say print s1 or s2, it says "NameError: name 's1' is not defined"
is is identity testing which compares the object IDs, == is the equality testing:
In [1]: s1 = "Hello World"
In [2]: s2 = "Hello World"
In [3]: s1 == s2
Out[3]: True
In [4]: s1 is s2
Out[4]: False
Also note that sorted returns a list, so change it to:
if ''.join(s2) == s1:
Or
if ''.join(sorted(s2)) == s1:
You could see this answer and use something which works for any sequence:
all(s1[i] <= s1[i+1] for i in xrange(len(s1) - 1))
Example:
>>> def alp(s1):
... return all(s1[i] <= s1[i+1] for i in xrange(len(s1) - 1))
...
>>> alp("test")
False
>>> alp("abcd")
True
I would do it using iter to nicely get the previous element:
def is_ordered(ss):
ss_iterable = iter(ss)
try:
current_item = next(ss_iterable)
except StopIteration:
#replace next line to handle the empty string case as desired.
#This is how *I* would do it, but others would prefer `return True`
#as indicated in the comments :)
#I suppose the question is "Is an empty sequence ordered or not?"
raise ValueError("Undefined result. Cannot accept empty iterable")
for next_item in ss_iterable:
if next_item < current_item:
return False
current_item = next_item
return True
This answer has complexity O(n) in the absolute worst case as opposed to the answers which rely on sort which is O(nlogn).
Make sure that you are comparing strings with strings:
In [8]: s = 'abcdef'
In [9]: s == ''.join(sorted(s))
Out[9]: True
In [10]: s2 = 'zxyw'
In [11]: s2 == ''.join(sorted(s2))
Out[11]: False
If s1 or s2 is a string, sorted will return a list, and you will then be comparing a string to a list. In order to do the comparison you want, using ''.join() will take the list and join all the elements together, essentially creating a string representing the sorted elements.
use something like this:
sorted() returns a list and you're trying to compare a list to a string, so change that list to a string first:
In [21]: "abcd"=="".join(sorted("abcd"))
Out[21]: True
In [22]: "qwerty"=="".join(sorted("qwerty"))
Out[22]: False
#comparsion of list and a string is False
In [25]: "abcd"==sorted("abcd")
Out[25]: False

Categories