find a word in a sentence using regular expression - python

So, I am trying to find a word (a complete word) in a sentence. Lets say the sentence is
Str1 = "1. how are you doing"
and that I am interested in finding if
Str2 = "1."
is in it. If I do,
re.search(r"%s\b" % Str2, Str1, re.IGNORECASE)
it should say that a match was found, isn't it? but the re.search fails for this query. why?

There are two things wrong here:
\b matches a position between a word and a non-word character, so between any letter, digit or underscore, and a character that doesn't match that set.
You are trying to match the boundary between a . and a space; both are non-word characters and the \b anchor would never match there.
You are handing re a 1., which means 'match a 1 and any other character'. You'd need to escape the dot by using re.escape() to match a literal ..
The following works better:
re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
Now it'll match your input literally, and look for a following space or the end of the string. The (?:...) creates a non-capturing group (always a good idea unless you specifically need to capture sections of the match); inside the group there is a | pipe to give two alternatives; either match \s (whitespace) or match $ (end of a line). You can expand this as needed.
Demo:
>>> import re
>>> Str1 = "1. how are you doing"
>>> Str2 = "1."
>>> re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
<_sre.SRE_Match object at 0x10457eed0>
>>> _.group(0)
'1. '

Related

Match a piece of text from the beginning up to the first occurrence of multicharacter substring

I want a regex search to end when it reaches ". ", but not when it reaches "."; I'm aware of using [^...] to exclude single characters, and have been using this to stop my search when it reaches a certain character. This does not work with strings though, as [^. ] stops when it reaches either character. Say I've got the code
import re
def main():
my_string = "The value of the float is 2.5. The int's value is 2.\n"
re.search("[^.]*", my_string)
main()
Which gives a match object with the string
"The value of the float is 2"
How can I change this so that it only stops after the string ". "?
Bonus question, is there any way to tell regex to stop whenever it reaches one of multiple strings? Using the above code as an example, if I wanted the search to end when it found the string ". " or the string ".\n", how would I go about it? Thanks!
To match from the start of a string till the . followed with whitespace, use
^(.*?)\.\s
If you want to only require a space or newline after a dot, use either of (the second is best if you have single chars only, use alternation if there are multicharacter alternatives)
^(.*?)\.(?: |\n)
^(.*?)\.[ \n]
See the regex demo.
Details
^ - start of a string
(.*?) - Capturing group 1: any 0+ chars other than linebreak chars, as few as possible
\. - a literal . char
\s - a whitespace char
(?: |\n) / [ \n] - a non-capturing group matching either a space or (|) a newline.
Python demo:
import re
my_string = "The value of the float is 2.5. The int's value is 2.\n"
m = re.search("^(.*?)\.\s", my_string) # Try to find a match
if m: # If there is a match
print(m.group(1)) # Show Group 1 value
NOTE If there can be line breaks in the input, pass re.S or re.DOTALL flag:
m = re.search("^(.*?)\.\s", my_string, re.DOTALL)
Besides classic approach explained by Wiktor, also splitting may be interesting solution in this case.
>>> my_string
"The value of the float is 2.5. The int's value is 2.\n"
>>> re.split('\. |\.\n', my_string)
['The value of the float is 2.5', "The int's value is 2", '']
If you want to include periods at the end of the sentence, you can do something like this:
['{}.'.format(sentence) for sentence in re.split('\. |\.\n', my_string) if sentence]
To handle multiple empty spaces between the sentences:
>>> str2 = "The value of the float is 2.5. The int's value is 2.\n\n "
>>> ['{}.'.format(sentence)
for sentence in re.split('\. \s*|\.\n\s*', str2)
if sentence
]
['The value of the float is 2.5.', "The int's value is 2."]

Word boundary doesn't work at end of non word character

>>> import re
>>> re.findall(ur'(?i)fizz\<buzz\>\b', u'fizz<buzz> - ANGLES', re.U)
[]
>>> re.findall(ur'(?i)fizz\<buzz\>', u'fizz<buzz> - ANGLES', re.U)
[u'fizz<buzz>']
The pattern must also match strings like fizzbuzz too, ie actual full word-only strings, but not inside other words. How can I accomplish this if \b after a non-word char isn't allowed?
If you know that your pattern ends with a non-word-character you can use the non-word-boundary \B. If you can't be sure you can use the lookahead (?!\w) to make sure, that what follows is not a word character.

How to capture the word with space around without capturing the space?

I've got a string like this s = "Hello this is Helloworld #helloworld #hiworld #nihaoworld " The idea is to catch all the hashtag however the hashtag needs to have a boundary around. e.g. if something like "Hello this is helloworld#helloworld"won't be captured.
I want to generate the following result as ["#helloworld","#hiworld","nihaoworld"]
I've got the following python code
import re
print re.findall('(?:^|\s+)(#[a-z]{1,})(?:\s+|$)', s)
The result I got is ["#helloworld","#nihaoworld"] with the middle word missing
I don't think you really need a regular expression for this, you can just use:
s.strip().split()
However, if you do want to use a regex, you could just use (?:^|\s)(#\w+):
>>> import re
>>> s = " #helloworld #hiworld #nihaoworld "
>>> re.findall(r'(?:^|\s)(#\w+)', s)
['#helloworld', '#hiworld', '#nihaoworld']
Explanation
Non-capturing group (?:^|\s)
1st Alternative ^
^ asserts position at start of the string
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group (#\w+)
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Python regex boundary

Is there an error in the way python handles '.' or '\b'? I'm not sure why this produces differing results.
import re
regex1 = r'\.?\b'
print bool(re.match(regex1, '.'))
regex2 = r'a?\b'
print bool(re.match(regex2, 'a'))
Output:
False
True
\b, word boundary, matches between word characters and non-word elements. As such, it will match between a word character like a and the end of the string, but not between a non-word character like . and end of string.
As geekosaur pointed out \b is merely a short way of writing
(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
In your case you may want to use
(?!\w)
or
(?!\S)
instead of \b.

Categories