I'm looking for the most efficient way to compare two strings, and I'm not sure which is better: == or in. Or is there some other way to do it that is more efficient that either of these?
Edit: I'm trying to check for equality
They do different things.
== tests for equality:
"tomato" == "tomato" # true
"potato" == "tomato" # false
"mat" == "tomato" # false
in tests for substring, and can be considered a (probably) more efficient version of str.find() != -1):
"tomato" in "tomato" # true
"potato" in "tomato" # false
"mat" in "tomato" # true <-- this is different than above
In both cases, they're the most efficient ways available of doing what they do. If you're using them to compare whether two strings are actually equal, then of course strA == strB is faster than (strA in strB) and (strB in strA).
Please define "comparing".
If you want to know if 2 strings are equal, == is the simplest way.
If you want to know if 1 string contains another, in is the simplest way.
If you want to know how much they overlap, considering gaps, you need complicated algorithms. How about a thick book on algorithms? (This is similar to comparing genetic sequences. I think a book on Bioinformatics algorithms would be very useful too. Anyhow, this case is way too complicated for Stack Overflow.)
EDIT:
For equality stick with "==". It's in Python to do its job.
== in Python is there for comparison purpose, while "in" has a wider definition (contains which includes comparison). Generally, precise and clear purpose constructs are the most optimized ones for doing the targeted job, because indirect constructs are generally based on simple and direct constructs, which should make == better in comparison context and less error-prone.
Related
I know that if we would like to know whether string a is contained in b we can use:
a in b
When a equals to b, the above express still returns True. I would like an expression that would return False when a == b and return True when a is a substring of b. So I used the following expression:
a in b and a != b
I just wonder is there a simpler expression in Python that works in the same way?
Not sure how efficient direct string comparisons are in Python, but assuming they are O(n) this should be reasonably efficient and still readable :
len(a) < len(b) and a in b
Likes the other answers it's also O(n), but the length operation should be O(1). It should also be better suited for edge cases (e.g. a being bigger than b)
This is going to be a short answer, but it is what it is. What you have is enough.
While there might be other alternatives, the way you did it is easy enough to understand, simple and more readable than anything else (of course this is subjective to each and everyone). IMO a simpler expression in Python that works in the same way doesn't exist. That's just my personal perspective on this subject. I'm usually encouraged to follow KISS:
The KISS principle states that most systems work best if they are kept
simple rather than made complicated; therefore simplicity should be a
key goal in design and unnecessary complexity should be avoided.
As an example, the other answer it's not any different from the explicit and, and the chaining can be confusing when used with in and == because many people will see that as (a1 in b) != a1, at least at first glance.
b.find(a) > 0 or b.startswith(a) and len(b) > len(a)
I'm not saying this is the best answer. It's just another solution. The OP's statement is as good as any. The accepted answer is also good. But this one does work and demonstrates a different approach to the problem.
I wish to use a custom compare function while calculating set. I wish to take advantage of the efficiencies of using set algorithm. technically I could create a double for loop to compare the two lists (keep, original) but I thought this might not be efficient.
eg://
textlist = ["ravi is happy", "happy ravi is", "is happy ravi", "is ravi happy"]
set() should return only 1 of these elements as the compare function would return if True if similarity between comparing items >= threshold.
In python. Thanks.
P.S.
The real trick is that I'd like to use my string_compare(t1,t2): Float to do the comparison rather then hashing and equal...
P.S.S.
C# has similar function:
How to remove similar string from a list?
I think this is what you were looking for:
{' '.join(sorted(sentence.split())) for sentence in textlist}
This re-orders the string and therefore Python set will now work because we are comparing identical strings.
This question already has answers here:
Why does comparing strings using either '==' or 'is' sometimes produce a different result?
(15 answers)
Closed 9 years ago.
I noticed a Python script I was writing was acting squirrelly, and traced it to an infinite loop, where the loop condition was while line is not ''. Running through it in the debugger, it turned out that line was in fact ''. When I changed it to !='' rather than is not '', it worked fine.
Also, is it generally considered better to just use '==' by default, even when comparing int or Boolean values? I've always liked to use 'is' because I find it more aesthetically pleasing and pythonic (which is how I fell into this trap...), but I wonder if it's intended to just be reserved for when you care about finding two objects with the same id.
For all built-in Python objects (like
strings, lists, dicts, functions,
etc.), if x is y, then x==y is also
True.
Not always. NaN is a counterexample. But usually, identity (is) implies equality (==). The converse is not true: Two distinct objects can have the same value.
Also, is it generally considered better to just use '==' by default, even
when comparing int or Boolean values?
You use == when comparing values and is when comparing identities.
When comparing ints (or immutable types in general), you pretty much always want the former. There's an optimization that allows small integers to be compared with is, but don't rely on it.
For boolean values, you shouldn't be doing comparisons at all. Instead of:
if x == True:
# do something
write:
if x:
# do something
For comparing against None, is None is preferred over == None.
I've always liked to use 'is' because
I find it more aesthetically pleasing
and pythonic (which is how I fell into
this trap...), but I wonder if it's
intended to just be reserved for when
you care about finding two objects
with the same id.
Yes, that's exactly what it's for.
I would like to show a little example on how is and == are involved in immutable types. Try that:
a = 19998989890
b = 19998989889 +1
>>> a is b
False
>>> a == b
True
is compares two objects in memory, == compares their values. For example, you can see that small integers are cached by Python:
c = 1
b = 1
>>> b is c
True
You should use == when comparing values and is when comparing identities. (Also, from an English point of view, "equals" is different from "is".)
The logic is not flawed. The statement
if x is y then x==y is also True
should never be read to mean
if x==y then x is y
It is a logical error on the part of the reader to assume that the converse of a logic statement is true. See http://en.wikipedia.org/wiki/Converse_(logic)
See This question
Your logic in reading
For all built-in Python objects (like
strings, lists, dicts, functions,
etc.), if x is y, then x==y is also
True.
is slightly flawed.
If is applies then == will be True, but it does NOT apply in reverse. == may yield True while is yields False.
Assume I have a set of strings S and a query string q. I want to know if any member of S is a substring of q. (For the purpose of this question substring includes equality, e.g. "foo" is a substring of "foo".) For example assume the function that does what I want is called anySubstring:
S = ["foo", "baz"]
q = "foobar"
assert anySubstring(S, q) # "foo" is a substring of "foobar"
S = ["waldo", "baz"]
assert not anySubstring(S, q)
Is there any easy-to-implement algorithm to do this with time complexity sublinear in len(S)? It's ok if S has to be processed into some clever data structure first because I will be querying each S with a lot of q strings, so the amortized cost of this preprocessing might be reasonable.
EDIT: To clarify, I don't care which member of S is a substring of q, only whether at least one is. In other words, I only care about a boolean answer.
I think Aho-Corasick algorithm does what you want. I think there is another solution which is very simple to implement, it's Karp-Rabin algorithm.
So if the length of S is way less then the sum of the lengths of the potential substrings your best option would be to build a suffix tree from S and then do a search in it. This is linear with respect to the length of S plus the summar length of the candidate substrings. Of course there can not be an algorithm with better complexity as you have to pass through all the input at least. If the case is opposite i.e. the length of s is more then the summar length of the substrings your best option would be aho-corasick.
Hope this helps.
Create a regular expression .*(S1|S2|...|Sn).* and construct its minimal DFA.
Run your query string through the DFA.
I'm writing a system to read data coming from devices that do the tracking of trucks.
This system will receive information of different types of equipment, thus being the trace strings that will receive will be different, deriving the equipment model.
So, I need an idea how to identify these strings to give the correct treatment for the same. For example, one of the units sends the following string:
[0,0,13825,355255057406002,0,250814,142421,-2197354498319328,-4743040708824992,800,9200,0,0,0,0,0,12,0,31,0,107]
Another device, the string comes this way:
SA200STT;459055;209;20140806;23:18:28;20702;-22.899244;-047.047640;000.044;000.00;11;1;68548721;12.60;100000;2;0016
So my question is, what is the best way for me to identify each of these strings?
The first step is to identify what is unique about each format. In the example you give, the first string starts and ends with [], and the second version starts with the sequence "SA200STT". So, a first approximation is to match on that:
import re
def identify(s):
if re.match(r'^\[.*\]$', s):
return "type 1"
elif re.match(r'^SA200STT.*$', s):
return "type 2"
else:
return "unknown"
s1 = r'[0,0,13825,355255057406002,0,250814,142421,-2197354498319328,-4743040708824992,800,9200,0,0,0,0,0,12,0,31,0,107]'
s2 = r'SA200STT;459055;209;20140806;23:18:28;20702;-22.899244;-047.047640;000.044;000.00;11;1;68548721;12.60;100000;2;0016'
print "s1:", identify(s1)
print "s2:", identify(s2)
When I run the above I get:
s1: type 1
s2: type 2
I doubt that's the actual algorithm that you need, but that's the general idea. Figure out how you can tell each format apart, then make an expression that detects that.
A note about using regular expressions:
Regular expressions can be slow, and in general should be avoided if they can be avoided (not just for the speed issue, but because they can make your code hard to understand). If performance or readability is a concern, consider alternative solutions such as comparing the first N characters, or the last N characters.
It sounds pretty simple.
Just check some distinguishing characteristic of the data to recognize the format.
Depending on how complex each of your formats is, you can probably do this without using a regex.
def parse(data):
parse_format = get_parser(data)
return parse_format(data)
def get_parser(data):
if is_format_a(data):
return parse_format_a;
if is_format_b(data):
return parse_format_b;
#etc
def is_format_a(data):
return data[0] == '['
def parse_format_a(data):
return data.strip('[]').split(',')
def parse_format_b(data):
return data.split(';')
Bryan Oakley give a good solution. But using his own words: The first step is to identify what is unique about each format.
You just have to check which one of the characters ; or , is present. Even if is present or not since they are exclusive!
For instance:
s1 = "[0,0,13825,355255057406002,0,250814,142421,-2197354498319328,-4743040708824992,800,9200,0,0,0,0,0,12,0,31,0,107]"
s2 = r'SA200STT;459055;209;20140806;23:18:28;20702;-22.899244;-047.047640;000.044;000.00;11;1;68548721;12.60;100000;2;0016'
if ',' in s1:
print("Type 1")
else
print("Type 2")
This seems to be the fastest way. since using regular expressions are slow and by reading your question I can say you will be reading from a device.Hence, you need speed.