Matching whole words using "in" in python - python

I've been searching around for some time for this, but have still not found an answer, maybe its got some thing to do with regular expressions, but i think there should be a simple answer that I am missing here. It seems very trivial to me ... here goes:
On the python interpreter I get:
"abc" in "abc123"
as True.
I want it a command that returns a False. I want the entire word to be matched.
Thanks!

in isn't how it's done.
>>> re.search(r'\babc\b', 'abc123')
>>> re.search(r'\babc\b', 'abc 123')
<_sre.SRE_Match object at 0x1146780>

If you want to do a plain match of just one word, use ==:
'abc' == 'abc123' # false
If you're doing 'abc' in ['cde','fdabc','abc123'], that returns False anyway:
'abc' in ['cde','fdabc','abc123'] # False
The reason 'abc' in 'abc123' returns true, from the docs:
For the Unicode and string types, x in y is true if and only if x is a
substring of y. An equivalent test is y.find(x) != -1.
So for comparing against a single string, use '==', and if comparing in a collection of strings, in can be used (you could also do 'abc' in ['abc123'] - since the behaviour of in works as your intuition imagines when y is a list or collection of sorts.

I might not understand your question, but it seems like what you want is "abc123" == "abc". This returns False, whereas "abc123" == "abc123" returns True.
Perhaps what you are looking for is matching on whole words but splitting on whitespace? That is, "abc" does not match "abc123", but it does match "abc def"? If that is the case, you want something like this:
def word_in (word, phrase):
return word in phrase.split()
word_in("abc", "abc123") # False
word_in("abc", "abc def") # True

In my case, I used a small trick. The whole word should be surrounded with spaces. So, if I want to find word like "Kill", I will search for " Kill ". In this case, it wont match with word like "Skill"
' kill ' in myString

Perhaps a wrong shot but it can be done in a more simple way.
def word_in(needle,haystack,case_sensitive=True):
if needle + ' ' in haystack or ' ' + needle + ' ' in haystack or needle + ' ' in haystack:
return True
return False
print word_in('abc','abc123')
print word_in('abc','abc 123')
First example produces False, the other True

You can try to use find method and compare the result to -1:
>>> a = "abc123"
>>> a.find("abc")
0
>>> a.find("bcd")
-1

Related

Regular expressions in python specific string plus numeric [duplicate]

I'm having trouble finding the correct regular expression for the scenario below:
Lets say:
a = "this is a sample"
I want to match whole word - for example match "hi" should return False since "hi" is not a word and "is" should return True since there is no alpha character on the left and on the right side.
Try
re.search(r'\bis\b', your_string)
From the docs:
\b Matches the empty string, but only at the beginning or end of a word.
Note that the re module uses a naive definition of "word" as a "sequence of alphanumeric or underscore characters", where "alphanumeric" depends on locale or unicode options.
Also note that without the raw string prefix, \b is seen as "backspace" instead of regex word boundary.
Try using the "word boundary" character class in the regex module, re:
x="this is a sample"
y="this isis a sample."
regex=re.compile(r"\bis\b") # For ignore case: re.compile(r"\bis\b", re.IGNORECASE)
regex.findall(y)
[]
regex.findall(x)
['is']
From the documentation of re.search().
\b matches the empty string, but only at the beginning or end of a word
...
For example r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'
I think that the behavior desired by the OP was not completely achieved using the answers given. Specifically, the desired output of a boolean was not accomplished. The answers given do help illustrate the concept, and I think they are excellent. Perhaps I can illustrate what I mean by stating that I think that the OP used the examples used because of the following.
The string given was,
a = "this is a sample"
The OP then stated,
I want to match whole word - for example match "hi" should return False since "hi" is not a word ...
As I understand, the reference is to the search token, "hi" as it is found in the word, "this". If someone were to search the string, a for the word "hi", they should receive False as the response.
The OP continues,
... and "is" should return True since there is no alpha character on the left and on the right side.
In this case, the reference is to the search token "is" as it is found in the word "is". I hope this helps clarify things as to why we use word boundaries. The other answers have the behavior of "don't return a word unless that word is found by itself -- not inside of other words." The "word boundary" shorthand character class does this job nicely.
Only the word "is" has been used in examples up to this point. I think that these answers are correct, but I think that there is more of the question's fundamental meaning that needs to be addressed. The behavior of other search strings should be noted to understand the concept. In other words, we need to generalize the (excellent) answer by #georg using re.match(r"\bis\b", your_string) The same r"\bis\b" concept is also used in the answer by #OmPrakash, who started the generalizing discussion by showing
>>> y="this isis a sample."
>>> regex=re.compile(r"\bis\b") # For ignore case: re.compile(r"\bis\b", re.IGNORECASE)
>>> regex.findall(y)
[]
Let's say the method which should exhibit the behavior I've discussed is named
find_only_whole_word(search_string, input_string)
The following behavior should then be expected.
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
Once again, this is how I understand the OP's question. We have a step towards that behavior with the answer from #georg , but it's a little hard to interpret/implement. to wit
>>> import re
>>> a = "this is a sample"
>>> re.search(r"\bis\b", a)
<_sre.SRE_Match object; span=(5, 7), match='is'>
>>> re.search(r"\bhi\b", a)
>>>
There is no output from the second command. The useful answer from #OmPrakesh shows output, but not True or False.
Here's a more complete sampling of the behavior to be expected.
>>> find_only_whole_word("this", a)
True
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("a", a)
True
>>> find_only_whole_word("sample", a)
True
# Use "ample", part of the word, "sample": (s)ample
>>> find_only_whole_word("ample", a)
False
# (t)his
>>> find_only_whole_word("his", a)
False
# (sa)mpl(e)
>>> find_only_whole_word("mpl", a)
False
# Any random word
>>> find_only_whole_word("applesauce", a)
False
>>>
This can be accomplished by the following code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
##file find_only_whole_word.py
import re
def find_only_whole_word(search_string, input_string):
# Create a raw string with word boundaries from the user's input_string
raw_search_string = r"\b" + search_string + r"\b"
match_output = re.search(raw_search_string, input_string)
##As noted by #OmPrakesh, if you want to ignore case, uncomment
##the next two lines
#match_output = re.search(raw_search_string, input_string,
# flags=re.IGNORECASE)
no_match_was_found = ( match_output is None )
if no_match_was_found:
return False
else:
return True
##endof: find_only_whole_word(search_string, input_string)
A simple demonstration follows. Run the Python interpreter from the same directory where you saved the file, find_only_whole_word.py.
>>> from find_only_whole_word import find_only_whole_word
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("cucumber", a)
False
# The excellent example from #OmPrakash
>>> find_only_whole_word("is", "this isis a sample")
False
>>>
The trouble with regex is that if hte string you want to search for in another string has regex characters it gets complicated. any string with brackets will fail.
This code will find a word
word="is"
srchedStr="this is a sample"
if srchedStr.find(" "+word+" ") >=0 or \
srchedStr.endswith(" "+word):
<do stuff>
The first part of the conditional searches for the text with a space on each side and the second part catches the end of string situation. Note that the endwith is boolean whereas the find returns an integer

Check if a string has unique characters excluding whitespace

I'm practicing questions from Cracking the coding interview to become better and just in case, be prepared. The first problem states: Find if a string has all unique characters or not? I wrote this and it works perfectly:
def isunique(string):
x = []
for i in string:
if i in x:
return False
else:
x.append(i)
return True
Now, my question is, what if I have all unique characters like in:
'I am J'
which would be pretty rare, but lets say it occurs by mere chance, how can I create an exception for the spaces? I a way it doesn't count the space as a character, so the func returns True and not False?
Now no matter how space or how many special characters in your string , it will just count the words :
import re
def isunique(string):
pattern=r'\w'
search=re.findall(pattern,string)
string=search
x = []
for i in string:
if i in x:
return False
else:
x.append(i)
return True
print(isunique('I am J'))
output:
True
without space words test case :
print(isunique('war'))
True
with space words test case:
print(isunique('w a r'))
True
repeating letters :
print(isunique('warrior'))
False
Create a list of characters you want to consider as non-characters and replace them in string. Then perform your function code.
As an alternative, to check the uniqueness of characters, the better approach will be to compare the length of final string with the set value of that string as:
def isunique(my_string):
nonchars = [' ', '.', ',']
for nonchar in nonchars:
my_string = my_string.replace(nonchar, '')
return len(set(my_string)) == len(my_string)
Sample Run:
>>> isunique( 'I am J' )
True
As per the Python's set() document:
Return a new set object, optionally with elements taken from iterable.
set is a built-in class. See set and Set Types — set, frozenset for
documentation about this class.
And... a pool of answers is never complete unless there is also a regex solution:
def is_unique(string):
import re
patt = re.compile(r"^.*?(.).*?(\1).*$")
return not re.search(patt, string)
(I'll leave the whitespace handling as an exercise to the OP)
An elegant approach (YMMV), with collections.Counter.
from collections import Counter
def isunique(string):
return Counter(string.replace(' ', '')).most_common(1)[0][-1] == 1
Alternatively, if your strings contain more than just whitespaces (tabs and newlines for instance), I'd recommend regex based substitution:
import re
string = re.sub(r'\s+', '', string, flags=re.M)
Simple solution
def isunique(string):
return all(string.count(i)==1 for i in string if i!=' ')

In Python, how to match a string in a word

Python code
str= "bcd"
word = "abcd1"
if pattern = re.search(str, word):
print pattern.group(1)
I want to search "bdc" in a word.. how do I do it?
>>> str= "bcd"
>>> word = "abcd1"
>>> str in word
True
Simple way
You can do it with find() function of a string object.
str = "abc"
word = "abcd1"
index = word.find (str)
if ( index != -1 ) :
print (index)
Index shows the first character of the subsecuence that you are looking for.
You can use word.index(str). It will return the position of str in word, or raise an exception if str is not found in word.
But if you like/want to use regular expressions, your use of re.search() is correct. re.search returns a match if the pattern is found in the string, or None if it is not found. Since None evaluates to False, you can just do:
if re.search(str, word):
# found the pattern
if all you need is to know if string is contained in word, you can also simply:
if str in word:
whatever
else:
somethingelse
You don't mention whether you need any information, like what position the result is in. If you just want a simple check, you can use the "If X in Y" approach:
In [1]: needle = "pie"
In [2]: haystack = "piece of string"
In [3]: if needle in haystack: print True
True
If you simply want to know if the string is in another word, you can do:
if 'bdc' in word:
# do something
If you need to do that with regex:
import re
pat = re.compile('bdc')
pat.search(word)
(Or just re.search('bdc', word) )
Also, you should most likely never use str as an variable name, as it already is a builtin function str()

How to check if a given character is considered as 'special' by the Python regex engine?

Is there an easy way to verify that the given character has a special regex function?
Of course I can collect regex characters in a list like ['.', "[", "]", etc.] to check that, but I guess there is a more elegant way.
You could use re.escape. For example:
>>> re.escape("a") == "a"
True
>>> re.escape("[") == "["
False
The idea is that if a character is a special one, then re.escape returns the character with a backslash in front of it. Otherwise, it returns the character itself.
You can use re.escape within all function as following :
>>> def checker(st):
... return all(re.escape(i)==i for i in st)
...
>>> checker('aab]')
False
>>> checker('aab')
True
>>> checker('aa.b3')
False
Per the documentation, re.escape will (emphasis mine):
Return string with all non-alphanumerics backslashed; this is useful
if you want to match an arbitrary literal string that may have regular
expression metacharacters in it.
So it tells you whether a character could be a meaningful one, not whether it is. For example:
>>> re.escape('&') == '&'
False
This is useful for processing arbitrary strings, as it ensures that all control characters are escaped, but not for telling you which actually needed to be. The simplest approach, in my view, is the one dismissed in the question:
char in set(r'.^$*+?{}[]\| ')
Elegance lies in the eyes of the beholder, however (IMHO) this (below) is the most generic/"timeproof" way of checking if a character is considered to be special by the Python Regex engine -
def isFalsePositive(char):
m = re.match(char, 'a')
if m is not None and m.end() == 1:
return True
else:
return False
def isSpecial(char):
try:
m = re.match(char, char)
except:
return True
if m is not None and m.end() == 1:
if isFalsePositive(char):
return True
else:
return False
else:
return True
P.S. -
isFalsePositive() may be overkill to check the special case of '.' (dot). :-)

How to check if a string only contains letters?

I'm trying to check if a string only contains letters, not digits or symbols.
For example:
>>> only_letters("hello")
True
>>> only_letters("he7lo")
False
Simple:
if string.isalpha():
print("It's all letters")
str.isalpha() is only true if all characters in the string are letters:
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise.
Demo:
>>> 'hello'.isalpha()
True
>>> '42hello'.isalpha()
False
>>> 'hel lo'.isalpha()
False
The str.isalpha() function works. ie.
if my_string.isalpha():
print('it is letters')
For people finding this question via Google who might want to know if a string contains only a subset of all letters, I recommend using regexes:
import re
def only_letters(tested_string):
match = re.match("^[ABCDEFGHJKLM]*$", tested_string)
return match is not None
You can leverage regular expressions.
>>> import re
>>> pattern = re.compile("^[a-zA-Z]+$")
>>> pattern.match("hello")
<_sre.SRE_Match object; span=(0, 5), match='hello'>
>>> pattern.match("hel7lo")
>>>
The match() method will return a Match object if a match is found. Otherwise it will return None.
An easier approach is to use the .isalpha() method
>>> "Hello".isalpha()
True
>>> "Hel7lo".isalpha()
False
isalpha() returns true if there is at least 1 character in the string and if all the characters in the string are alphabets.
Actually, we're now in globalized world of 21st century and people no longer communicate using ASCII only so when anwering question about "is it letters only" you need to take into account letters from non-ASCII alphabets as well. Python has a pretty cool unicodedata library which among other things allows categorization of Unicode characters:
unicodedata.category('陳')
'Lo'
unicodedata.category('A')
'Lu'
unicodedata.category('1')
'Nd'
unicodedata.category('a')
'Ll'
The categories and their abbreviations are defined in the Unicode standard. From here you can quite easily you can come up with a function like this:
def only_letters(s):
for c in s:
cat = unicodedata.category(c)
if cat not in ('Ll','Lu','Lo'):
return False
return True
And then:
only_letters('Bzdrężyło')
True
only_letters('He7lo')
False
As you can see the whitelisted categories can be quite easily controlled by the tuple inside the function. See this article for a more detailed discussion.
The string.isalpha() function will work for you.
See http://www.tutorialspoint.com/python/string_isalpha.htm
Looks like people are saying to use str.isalpha.
This is the one line function to check if all characters are letters.
def only_letters(string):
return all(letter.isalpha() for letter in string)
all accepts an iterable of booleans, and returns True iff all of the booleans are True.
More generally, all returns True if the objects in your iterable would be considered True. These would be considered False
0
None
Empty data structures (ie: len(list) == 0)
False. (duh)
(1) Use str.isalpha() when you print the string.
(2) Please check below program for your reference:-
str = "this"; # No space & digit in this string
print str.isalpha() # it gives return True
str = "this is 2";
print str.isalpha() # it gives return False
Note:- I checked above example in Ubuntu.
A pretty simple solution I came up with: (Python 3)
def only_letters(tested_string):
for letter in tested_string:
if letter not in "abcdefghijklmnopqrstuvwxyz":
return False
return True
You can add a space in the string you are checking against if you want spaces to be allowed.

Categories