Regular expressions in python specific string plus numeric [duplicate] - python

I'm having trouble finding the correct regular expression for the scenario below:
Lets say:
a = "this is a sample"
I want to match whole word - for example match "hi" should return False since "hi" is not a word and "is" should return True since there is no alpha character on the left and on the right side.

Try
re.search(r'\bis\b', your_string)
From the docs:
\b Matches the empty string, but only at the beginning or end of a word.
Note that the re module uses a naive definition of "word" as a "sequence of alphanumeric or underscore characters", where "alphanumeric" depends on locale or unicode options.
Also note that without the raw string prefix, \b is seen as "backspace" instead of regex word boundary.

Try using the "word boundary" character class in the regex module, re:
x="this is a sample"
y="this isis a sample."
regex=re.compile(r"\bis\b") # For ignore case: re.compile(r"\bis\b", re.IGNORECASE)
regex.findall(y)
[]
regex.findall(x)
['is']
From the documentation of re.search().
\b matches the empty string, but only at the beginning or end of a word
...
For example r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'

I think that the behavior desired by the OP was not completely achieved using the answers given. Specifically, the desired output of a boolean was not accomplished. The answers given do help illustrate the concept, and I think they are excellent. Perhaps I can illustrate what I mean by stating that I think that the OP used the examples used because of the following.
The string given was,
a = "this is a sample"
The OP then stated,
I want to match whole word - for example match "hi" should return False since "hi" is not a word ...
As I understand, the reference is to the search token, "hi" as it is found in the word, "this". If someone were to search the string, a for the word "hi", they should receive False as the response.
The OP continues,
... and "is" should return True since there is no alpha character on the left and on the right side.
In this case, the reference is to the search token "is" as it is found in the word "is". I hope this helps clarify things as to why we use word boundaries. The other answers have the behavior of "don't return a word unless that word is found by itself -- not inside of other words." The "word boundary" shorthand character class does this job nicely.
Only the word "is" has been used in examples up to this point. I think that these answers are correct, but I think that there is more of the question's fundamental meaning that needs to be addressed. The behavior of other search strings should be noted to understand the concept. In other words, we need to generalize the (excellent) answer by #georg using re.match(r"\bis\b", your_string) The same r"\bis\b" concept is also used in the answer by #OmPrakash, who started the generalizing discussion by showing
>>> y="this isis a sample."
>>> regex=re.compile(r"\bis\b") # For ignore case: re.compile(r"\bis\b", re.IGNORECASE)
>>> regex.findall(y)
[]
Let's say the method which should exhibit the behavior I've discussed is named
find_only_whole_word(search_string, input_string)
The following behavior should then be expected.
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
Once again, this is how I understand the OP's question. We have a step towards that behavior with the answer from #georg , but it's a little hard to interpret/implement. to wit
>>> import re
>>> a = "this is a sample"
>>> re.search(r"\bis\b", a)
<_sre.SRE_Match object; span=(5, 7), match='is'>
>>> re.search(r"\bhi\b", a)
>>>
There is no output from the second command. The useful answer from #OmPrakesh shows output, but not True or False.
Here's a more complete sampling of the behavior to be expected.
>>> find_only_whole_word("this", a)
True
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("a", a)
True
>>> find_only_whole_word("sample", a)
True
# Use "ample", part of the word, "sample": (s)ample
>>> find_only_whole_word("ample", a)
False
# (t)his
>>> find_only_whole_word("his", a)
False
# (sa)mpl(e)
>>> find_only_whole_word("mpl", a)
False
# Any random word
>>> find_only_whole_word("applesauce", a)
False
>>>
This can be accomplished by the following code:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
##file find_only_whole_word.py
import re
def find_only_whole_word(search_string, input_string):
# Create a raw string with word boundaries from the user's input_string
raw_search_string = r"\b" + search_string + r"\b"
match_output = re.search(raw_search_string, input_string)
##As noted by #OmPrakesh, if you want to ignore case, uncomment
##the next two lines
#match_output = re.search(raw_search_string, input_string,
# flags=re.IGNORECASE)
no_match_was_found = ( match_output is None )
if no_match_was_found:
return False
else:
return True
##endof: find_only_whole_word(search_string, input_string)
A simple demonstration follows. Run the Python interpreter from the same directory where you saved the file, find_only_whole_word.py.
>>> from find_only_whole_word import find_only_whole_word
>>> a = "this is a sample"
>>> find_only_whole_word("hi", a)
False
>>> find_only_whole_word("is", a)
True
>>> find_only_whole_word("cucumber", a)
False
# The excellent example from #OmPrakash
>>> find_only_whole_word("is", "this isis a sample")
False
>>>

The trouble with regex is that if hte string you want to search for in another string has regex characters it gets complicated. any string with brackets will fail.
This code will find a word
word="is"
srchedStr="this is a sample"
if srchedStr.find(" "+word+" ") >=0 or \
srchedStr.endswith(" "+word):
<do stuff>
The first part of the conditional searches for the text with a space on each side and the second part catches the end of string situation. Note that the endwith is boolean whereas the find returns an integer

Related

simple regex pattern not matching [duplicate]

>>> import re
>>> s = 'this is a test'
>>> reg1 = re.compile('test$')
>>> match1 = reg1.match(s)
>>> print match1
None
in Kiki that matches the test at the end of the s. What do I miss? (I tried re.compile(r'test$') as well)
Use
match1 = reg1.search(s)
instead. The match function only matches at the start of the string ... see the documentation here:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
Your regex does not match the full string. You can use search instead as Useless mentioned, or you can change your regex to match the full string:
'^this is a test$'
Or somewhat harder to read but somewhat less useless:
'^t[^t]*test$'
It depends on what you're trying to do.
It's because of that match method returns None if it couldn't find expected pattern, if it find the pattern it would return an object with type of _sre.SRE_match .
So, if you want Boolean (True or False) result from match you must check the result is None or not!
You could examine texts are matched or not somehow like this:
string_to_evaluate = "Your text that needs to be examined"
expected_pattern = "pattern"
if re.match(expected_pattern, string_to_evaluate) is not None:
print("The text is as you expected!")
else:
print("The text is not as you expected!")

python regular expression : How can I filter only special characters?

I want to check either given words contain special character or not.
so below is my python code
The literal 'a#bcd' has '#', so it will be matchd and it's ok.
but 'a1bcd' has no special character. but it was filtered too!!
import re
regexp = re.compile('[~`!##$%^&*()-_=+\[\]{}\\|;:\'\",.<>/?]+')
if regexp.search('a#bcd') :
print 'matched!! nich catch!!'
if regexp.search('a1bcd') :
print 'something is wrong here!!!'
result :
python ../special_char.py
matched!! nich catch!!
something is wrong here!!!
I have no idea why it works like above..someone help me..T_T;;;
thanks~
Move the dash in you regular expression to the start of the [] group, like this:
regexp = re.compile('[-~`!##$%^&*()_=+\[\]{}\\|;:\'\",.<>/?]+')
Where you had the dash, it was read with the surrounding characters as )-_ and since it is inside [] it is interpreted as asking to match a range from ) to _. If you move the dash to just after the [ it has no special meaning and instead matches itself.
Here's an interactive session showing the specific problem there was in your regular expression:
>>> import re
>>> print re.search('[)-_]', 'abcd')
None
>>> print re.search('[)-_]', 'a1b')
<_sre.SRE_Match object at 0x7f71082247e8>
>>> print re.search('[)-_]', 'a1b').group(0)
1
After fixing it:
>>> print re.search('[-)_]', 'a1b')
None
Unless there's some reason not visible in your question, I'd also say that the final + is not needed.
re will be relatively slow for this
I'd suggest trying
specialchars = '''-~`!##$%^&*()_=+[]{}\\|;:'",.<>/?'''
len(word) != len(word.translate(None, specialchars))
or
set(word) & set(specialchars)

Python regular expression; why do the search & match appear to find alpha chars in a number string?

I'm running search below Idle, in Python 2.7 in a Windows Bus. 64 bit environment.
According to RegexBuddy, the search pattern ('patternalphaonly') should not produce a match against a string of digits.
I looked at "http://docs.python.org/howto/regex.html", but did not see anything there that would explain why the search and match appear to be successful in finding something matching the pattern.
Does anyone know what I'm doing wrong, or misunderstanding?
>>> import re
>>> numberstring = '3534543234543'
>>> patternalphaonly = re.compile('[a-zA-Z]*')
>>> result = patternalphaonly.search(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
>>> result = patternalphaonly.match(numberstring)
>>> print result
<_sre.SRE_Match object at 0x02CEAD40>
Thanks
The star operator (*) indicates zero or more repetitions. Your string has zero repetitions of an English alphabet letter because it is entirely numbers, which is perfectly valid when using the star (repeat zero times). Instead use the + operator, which signifies one or more repetitions. Example:
>>> n = "3534543234543"
>>> r1 = re.compile("[a-zA-Z]*")
>>> r1.match(n)
<_sre.SRE_Match object at 0x07D85720>
>>> r2 = re.compile("[a-zA-Z]+") #using the + operator to make sure we have at least one letter
>>> r2.match(n)
Helpful link on repetition operators.
Everything eldarerathis says is true. However, with a variable named: 'patternalphaonly' I would assume that the author wants to verify that a string is composed of alpha chars only. If this is true then I would add additional end-of-string anchors to the regex like so:
patternalphaonly = re.compile('^[a-zA-Z]+$')
result = patternalphaonly.search(numberstring)
Or, better yet, since this will only ever match at the beginning of the string, use the preferred match method:
patternalphaonly = re.compile('[a-zA-Z]+$')
result = patternalphaonly.match(numberstring)
(Which, as John Machin has pointed out, is evidently faster for some as-yet unexplained reason.)

Why is this regular expression not working ({m, n})?

Trying to understand regular expressions and I am on the repetitions part: {m, n}.
I have this code:
>>> p = re.compile('a{1}b{1, 3}')
>>> p.match('ab')
>>> p.match('abbb')
As you can see both the strings are not matching the pattern. Why is this happening?
You shouldn't put a space after the comma, and the {1} is redundant.
Try
p = re.compile('a{1}b{1,3}')
...and mind the space.
Remove the extra whitespace in b.
Change:
p = re.compile('a{1}b{1, 3}')
to:
p = re.compile('a{1}b{1,3}')
^ # no whitespace
and all should be well.
You are seeing some re behaviour that is very "dark corner", nigh on a bug (or two).
# Python 2.7.1
>>> import re
>>> pat = r"b{1, 3}\Z"
>>> bool(re.match(pat, "bb"))
False
>>> bool(re.match(pat, "b{1, 3}"))
True
>>> bool(re.match(pat, "bb", re.VERBOSE))
False
>>> bool(re.match(pat, "b{1, 3}", re.VERBOSE))
False
>>> bool(re.match(pat, "b{1,3}", re.VERBOSE))
True
>>>
In other words, the pattern "b{1, 3}" matches the literal text "b{1, 3}" in normal mode, and the literal text "b{1,3}" in VERBOSE mode.
The "Law of Least Astonishment" would suggest either (1) the space in front of the 3 was ignored and it matched "b", "bb", or "bbb" as appropriate [preferable] or (2) an exception at compile time.
Looking at it another way: Two possibilities: (a) The person who writes "{1, 3}" is imbued with the spirit of PEP8 and believes it is prescriptive and applies everywhere (b) The person who writes that has tested re undocumented behaviour and actually wants to match the literal text "b{1, 3}" and perversely wants to use r"b{1, 3}" instead of explicitly escaping: r"b\{1, 3}". Seems to me that (a) is much more probable than (b), and re should act accordingly.
Yet another perspective: When the space is reached, it has already parsed {, a string of digits, and a comma i.e. well into the {m,n} "operator" ... to silently ignore an unexpected character and treat it as though it was literal text is mind-boggling, perlish, etc.
Update Bug report lodged.
Do not insert spaces between { and }.
p = re.compile('a{1}b{1,3}')
You can compile the regex with VERBOSE flag, this means most whitespace in the regex would be ignored. I think this is a very good practice to describe complex regular expressions in a more readable manner.
See here for details...
Hope this helps...

finding and returning a string with a specified prefix

I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.

Categories