Comparison of strings in Python [duplicate]

Comparison of strings in Python [duplicate] - python

This question already has answers here:
Why does comparing strings using either '==' or 'is' sometimes produce a different result?
(15 answers)
Closed 9 years ago.
I noticed a Python script I was writing was acting squirrelly, and traced it to an infinite loop, where the loop condition was while line is not ''. Running through it in the debugger, it turned out that line was in fact ''. When I changed it to !='' rather than is not '', it worked fine.
Also, is it generally considered better to just use '==' by default, even when comparing int or Boolean values? I've always liked to use 'is' because I find it more aesthetically pleasing and pythonic (which is how I fell into this trap...), but I wonder if it's intended to just be reserved for when you care about finding two objects with the same id.

For all built-in Python objects (like
strings, lists, dicts, functions,
etc.), if x is y, then x==y is also
True.
Not always. NaN is a counterexample. But usually, identity (is) implies equality (==). The converse is not true: Two distinct objects can have the same value.
Also, is it generally considered better to just use '==' by default, even
when comparing int or Boolean values?
You use == when comparing values and is when comparing identities.
When comparing ints (or immutable types in general), you pretty much always want the former. There's an optimization that allows small integers to be compared with is, but don't rely on it.
For boolean values, you shouldn't be doing comparisons at all. Instead of:
if x == True:
# do something
write:
if x:
# do something
For comparing against None, is None is preferred over == None.
I've always liked to use 'is' because
I find it more aesthetically pleasing
and pythonic (which is how I fell into
this trap...), but I wonder if it's
intended to just be reserved for when
you care about finding two objects
with the same id.
Yes, that's exactly what it's for.

I would like to show a little example on how is and == are involved in immutable types. Try that:
a = 19998989890
b = 19998989889 +1
>>> a is b
False
>>> a == b
True
is compares two objects in memory, == compares their values. For example, you can see that small integers are cached by Python:
c = 1
b = 1
>>> b is c
True
You should use == when comparing values and is when comparing identities. (Also, from an English point of view, "equals" is different from "is".)

The logic is not flawed. The statement
if x is y then x==y is also True
should never be read to mean
if x==y then x is y
It is a logical error on the part of the reader to assume that the converse of a logic statement is true. See http://en.wikipedia.org/wiki/Converse_(logic)

See This question
Your logic in reading
For all built-in Python objects (like
strings, lists, dicts, functions,
etc.), if x is y, then x==y is also
True.
is slightly flawed.
If is applies then == will be True, but it does NOT apply in reverse. == may yield True while is yields False.

Related

use custom compare function in set equality python

I wish to use a custom compare function while calculating set. I wish to take advantage of the efficiencies of using set algorithm. technically I could create a double for loop to compare the two lists (keep, original) but I thought this might not be efficient.
eg://
textlist = ["ravi is happy", "happy ravi is", "is happy ravi", "is ravi happy"]
set() should return only 1 of these elements as the compare function would return if True if similarity between comparing items >= threshold.
In python. Thanks.
P.S.
The real trick is that I'd like to use my string_compare(t1,t2): Float to do the comparison rather then hashing and equal...
P.S.S.
C# has similar function:
How to remove similar string from a list?

I think this is what you were looking for:
{' '.join(sorted(sentence.split())) for sentence in textlist}
This re-orders the string and therefore Python set will now work because we are comparing identical strings.

What is the most efficient way of comparring 2 strings in Python

I'm looking for the most efficient way to compare two strings, and I'm not sure which is better: == or in. Or is there some other way to do it that is more efficient that either of these?
Edit: I'm trying to check for equality

They do different things.
== tests for equality:
"tomato" == "tomato" # true
"potato" == "tomato" # false
"mat" == "tomato" # false
in tests for substring, and can be considered a (probably) more efficient version of str.find() != -1):
"tomato" in "tomato" # true
"potato" in "tomato" # false
"mat" in "tomato" # true <-- this is different than above
In both cases, they're the most efficient ways available of doing what they do. If you're using them to compare whether two strings are actually equal, then of course strA == strB is faster than (strA in strB) and (strB in strA).

Please define "comparing".
If you want to know if 2 strings are equal, == is the simplest way.
If you want to know if 1 string contains another, in is the simplest way.
If you want to know how much they overlap, considering gaps, you need complicated algorithms. How about a thick book on algorithms? (This is similar to comparing genetic sequences. I think a book on Bioinformatics algorithms would be very useful too. Anyhow, this case is way too complicated for Stack Overflow.)
EDIT:
For equality stick with "==". It's in Python to do its job.

== in Python is there for comparison purpose, while "in" has a wider definition (contains which includes comparison). Generally, precise and clear purpose constructs are the most optimized ones for doing the targeted job, because indirect constructs are generally based on simple and direct constructs, which should make == better in comparison context and less error-prone.

How is the string.join(str_list, ''") implemented under the hood in Python?

I know that concatenating two strings using the += operator makes a new copy of the old string and then concatenates the new string to that, resulting in quadratic time complexity.
This answer gives a nice time comparison between the += operation and string.join(str_list, ''). It looks like the join() method runs in linear time (correct me if I am wrong). Out of curiosity, I wanted to know how the string.join(str_list, '') method is implemented in Python since strings are immutable objects?

It's implemented in C, so python mutability is less important. You can find the appropriate source here: unicodeobject.c

How does str.startswith really work?

I've been playing for a bit with startswith() and I've discovered something interesting:
>>> tup = ('1', '2', '3')
>>> lis = ['1', '2', '3', '4']
>>> '1'.startswith(tup)
True
>>> '1'.startswith(lis)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: startswith first arg must be str or a tuple of str, not list
Now, the error is obvious and casting the list into a tuple will work just fine as it did in the first place:
>>> '1'.startswith(tuple(lis))
True
Now, my question is: why the first argument must be str or a tuple of str prefixes, but not a list of str prefixes?
AFAIK, the Python code for startswith() might look like this:
def startswith(src, prefix):
return src[:len(prefix)] == prefix
But that just confuses me more, because even with it in mind, it still shouldn't make any difference whether is a list or tuple. What am I missing ?

There is technically no reason to accept other sequence types, no. The source code roughly does this:
if isinstance(prefix, tuple):
for substring in prefix:
if not isinstance(substring, str):
raise TypeError(...)
return tailmatch(...)
elif not isinstance(prefix, str):
raise TypeError(...)
return tailmatch(...)
(where tailmatch(...) does the actual matching work).
So yes, any iterable would do for that for loop. But, all the other string test APIs (as well as isinstance() and issubclass()) that take multiple values also only accept tuples, and this tells you as a user of the API that it is safe to assume that the value won't be mutated. You can't mutate a tuple but the method could in theory mutate the list.
Also note that you usually test for a fixed number of prefixes or suffixes or classes (in the case of isinstance() and issubclass()); the implementation is not suited for a large number of elements. A tuple implies that you have a limited number of elements, while lists can be arbitrarily large.
Next, if any iterable or sequence type would be acceptable, then that would include strings; a single string is also a sequence. Should then a single string argument be treated as separate characters, or as a single prefix?
So in other words, it's a limitation to self-document that the sequence won't be mutated, is consistent with other APIs, it carries an implication of a limited number of items to test against, and removes ambiguity as to how a single string argument should be treated.
Note that this was brought up before on the Python Ideas list; see this thread; Guido van Rossum's main argument there is that you either special case for single strings or for only accepting a tuple. He picked the latter and doesn't see a need to change this.

This has already been suggested on Python-ideas a couple of years back see: str.startswith taking any iterator instead of just tuple and GvR had this to say:
The current behavior is intentional, and the ambiguity of strings
themselves being iterables is the main reason. Since startswith() is
almost always called with a literal or tuple of literals anyway, I see
little need to extend the semantics.
In addition to that, there seemed to be no real motivation as to why to do this.
The current approach keeps things simple and fast,
unicode_startswith (and endswith) check for a tuple argument and then for a string one. They then call tailmatch in the appropriate direction. This is, arguably, very easy to understand in its current state, even for strangers to C code.
Adding other cases will only lead to more bloated and complex code for little benefit while also requiring similar changes to any other parts of the unicode object.

On a similar note, here is an excerpt from a talk by core developer, Raymond Hettinger discussing API design choices regarding certain string methods, including recent changes to the str.startswith signature. While he briefly mentions this fact that str.startswith accepts a string or tuple of strings and does not expound, the talk is informative on the decisions and pain points both core developers and contributors have dealt with leading up to the present API.

Python std.stdout.write loop problems

I'm not sure of what's going on here, but I have some python code:
import sys
max_cols = 350
max_rows = 1
r1 = range(max_rows)
r2 = range(max_cols)
for y in r1:
for x in r2:
sys.stdout.write('something')
if x is not max_cols-1:
sys.stdout.write(',')
Now, this works fine for values of max_cols <= 257.
However, if you use >= 258, you end up with an extra ',' at the end.
(The idea here is obviously to generate a CSV file.)
Now, 256 is a CS number, so there's clearly something going on here that I'm unaware of, since everything works perfectly up until that point. This also happens when I try to write to a file using the same pattern.
Why does this happen?
Using Python 3.2.

is is not for checking equality but for checking identity. x is y is only true if both variables refer to the same object. As it happens, CPython resuses objects for small integers - but in general, the concept of identity is very different from the concept of equality. Use the correct operators, == and != for equality and inequality respectively, and it works.
Also note that the code can be made much simpler and robust by just using the csv module. No need to reinvent the wheel.

The CPython implementation caches small numbers, so all instances of the number 12 are the same object. The is operator compares the identities of objects, not their values. What you wanted to do was use the != operator to compare the values.
It's likely that your instance of the CPython implementation caches numbers up to 256.
Incidentally, whenever you bump into a pattern like this, where you have to drop the last separator from a list of delimited things, str.join is probably what you wanted.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparison of strings in Python [duplicate] - python

The logic is not flawed. The statement if x is y then x==y is also True should never be read to mean if x==y then x is y It is a logical error on the part of the reader to assume that the converse of a logic statement is true. See http://en.wikipedia.org/wiki/Converse_(logic)

See This question Your logic in reading For all built-in Python objects (like strings, lists, dicts, functions, etc.), if x is y, then x==y is also True. is slightly flawed. If is applies then == will be True, but it does NOT apply in reverse. == may yield True while is yields False.

Related

use custom compare function in set equality python

What is the most efficient way of comparring 2 strings in Python

How is the string.join(str_list, ''") implemented under the hood in Python?

How does str.startswith really work?

Python std.stdout.write loop problems

Categories

Resources