regex. Match all words starting with two letter - python

I wish to find all words that start with "Am" and this is what I tried so far with python
import re
my_string = "America's mom, American"
re.findall(r'\b[Am][a-zA-Z]+\b', my_string)
but this is the output that I get
['America', 'mom', 'American']
Instead of what I want
['America', 'American']
I know that in regex [Am] means match either A or m, but is it possible to match A and m as well?

The [Am], a positive character class, matches either A or m. To match a sequence of chars, you need to use them one after another.
Remove the brackets:
import re
my_string = "America's mom, American"
print(re.findall(r'\bAm[a-zA-Z]+\b', my_string))
# => ['America', 'American']
See the Python demo
This pattern details:
\b - a word boundary
Am - a string of chars matched as a sequence Am
[a-zA-Z]+ - 1 or more ASCII letters
\b - a word boundary.

Don't use character class:
import re
my_string = "America's mom, American"
re.findall(r'\bAm[a-zA-Z]+\b', my_string)

re.findall(r'(Am\w+)', my_text, re.I)

Related

Find all symbols and spaces together between two alphanumeric characters in Python3

In this problem I'm trying to find symbols and spaces between two alphanumeric characters. I am using regular expressions, but I cannot get result as I want. Any valuable tricks for this code is appreciated (only for regex solution):
import re
s = "This$#is% Matrix# %!"
regex_pattern = '\w(.[#_!#$%^&*()<>?/\|}{~:\s]*)\w' # needed to be solve
re.findall(regex_pattern, s)
Output is:
['h', '$#', '% ', 't']
Expected output is:
['$#', '% ']
You can try this simple pattern:
import re
s = "This$#is% Matrix# %!"
regex_pattern = '(?<=\w)[^\w]+?(?=\w)'
print(re.findall(regex_pattern, s))
Output:
['$#', '% ']
Basically, the pattern (?<=\w)[^\w]+?(?=\w) searches for clumps of all non-alphanumeric characters (that has to be at least one character in length) that are between 2 alphanumeric characters.
Using a regex find all approach:
s = "This$#is% Matrix# %!"
matches = re.findall(r'(?<=\w)[#_!#$%^&*()<>?/\|}{~:\s]+(?=\w)', s)
print(matches) # ['$#', '% ']
This approach is similar to yours, except that it simply searches for a sequence of symbols or whitespace characters which are surrounded on both sides by word characters.
Your regex uses quantifier * (0 or more) to match a series of non-alpha chars, so you get matches with no non-alpha characters between; you should use + to match one or more non-alpha chars:
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\w([#_!#$%^&*()<>?/\|}{~:\s]+)\w' # needed to be solve
print(re.findall(regex_pattern, s) )
Output:
['$#', '% ']
My 'trick' is is to use e.g. regex101.com to make sure the regex works before going to code, and to build up the regex a step at a time so you know when you add a step and the regex stops matching that it was the most recent step causing problems.
Your shortest solution is
import re
s = "This$#is% Matrix# %!"
regex_pattern = r'\b\W+\b'
print( re.findall(regex_pattern, s) ) # => ['$#', '% ']
See the online Python demo.
Why it works
\b - the word boundary followed with \W pattern matches a location that is right after a word char (i.e. a letter, digit or _)
\W+ - matches one or more non-word chars, the chars other than letters, digits and underscores
\b - right after \W, the word boundary matches a location that is immediately followed with a word char.
See the regex demo.

Word boundary doesn't work at end of non word character

>>> import re
>>> re.findall(ur'(?i)fizz\<buzz\>\b', u'fizz<buzz> - ANGLES', re.U)
[]
>>> re.findall(ur'(?i)fizz\<buzz\>', u'fizz<buzz> - ANGLES', re.U)
[u'fizz<buzz>']
The pattern must also match strings like fizzbuzz too, ie actual full word-only strings, but not inside other words. How can I accomplish this if \b after a non-word char isn't allowed?
If you know that your pattern ends with a non-word-character you can use the non-word-boundary \B. If you can't be sure you can use the lookahead (?!\w) to make sure, that what follows is not a word character.

Replace strings in a string by a substring of those strings

Let's say I have a string like this:
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
and I want to turn it into
'(xy09 and foobar or (abc123 and something))'
then - in this particular case - I could simply do
s.replace('X_', "")
which gives the desired output.
However, in my actual data there might be not only X_ but also other prefixes, so the above replace statement does not work.
What I would need instead is a replacement of
a capital letter followed by an underscore and an arbitrary sequence of letters and numbers
by
everything after the first underscore.
So, to extract the desired elements I could use:
import re
print(re.findall('[A-Z]{1}_[a-zA-Z0-9]+', s))
which prints
['X_xy09', 'X_foobar', 'X_abc123', 'X_something']
how can I now replace those elements so that I obtain
'(xy09 and foobar or (abc123 and something))'
?
If you need to remove an uppercase ASCII letter with an underscore after it, only when not preceded with a word char and when followed with an alphanumeric char, you may use
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
print(re.sub(r'\b[A-Z]_([a-zA-Z0-9])', r'\1', s))
See the Python demo and a regex demo.
Pattern details
\b - a leading word boundary
[A-Z]_ - an ASCII uppercase letter and _
([a-zA-Z0-9]) - Group 1 (later referenced to with \1 from the replacement pattern): 1 alphanumeric char.
If you just need to replace a capital letter followed by an underscore, you can use the regular expression r'[A-Z]_'.
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
re.sub(r'[A-Z]_', '', s)
You may need to add to it if you have other criteria not mentioned. (For example, some of your target values follow a word boundary and some follow parentheses.) The above might give you the wrong output if you have input like XY_something. It depends on what you expect the output to be.
Another re.sub() approach:
import re
s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
result = re.sub(r'[A-Z]_(?=[a-zA-Z0-9]+)', '', s)
print(result)
The output:
(xy09 and foobar or (abc123 and something))
[A-Z]_(?=[a-zA-Z0-9]+) - (?=...) positive lookahead assertion, ensures that substituted [A-Z]_ substring is followed by alphanumeric sequence [a-zA-Z0-9]+
You could use re.sub() with a lookahead assertion:
>>> import re
>>> s = '(X_xy09 and X_foobar or (X_abc123 and X_something))'
>>> re.sub(r'\b[A-Z]_(?=[a-zA-Z0-9])', '', s)
'(xy09 and foobar or (abc123 and something))'
from the docs:
(?=...)
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

regex - how to select a word that has a '-' in it?

I am learning Regular Expressions, so apologies for a simple question.
I want to select the words that have a '-' (minus sign) in it but not at the beginning and not at the end of the word
I tried (using findall):
r'\b-\b'
for
str = 'word semi-column peace'
but, of course got only:
['-']
Thank you!
What you actually want to do is a regex like this:
\w+-\w+
What this means is find a alphanumeric character at least once as indicated by the utilization of '+', then find a '-', following by another alphanumeric character at least once, again, as indicated by the '+' again.
str is a built in name, better not to use it for naming
st = 'word semi-column peace'
# \w+ word - \w+ word after -
print(re.findall(r"\b\w+-\w+\b",st))
['semi-column']
a '-' (minus sign) in it but not at the beginning and not at the end of the word
Since "-" is not a word character, you can't use word boundaries (\b) to prevent a match from words with hyphens at the beggining or end. A string like "-not-wanted-" will match both \b\w+-\w+\b and \w+-\w+.
We need to add an extra condition before and after the word:
Before: (?<![-\w]) not preceded by either a hyphen nor a word character.
After: (?![-\w]) not followed by either a hyphen nor a word character.
Also, a word may have more than 1 hyphen in it, and we need to allow it. What we can do here is repeat the last part of the word ("hyphen and word characters") once or more:
\w+(?:-\w+)+ matches:
\w+ one or more word characters
(?:-\w+)+ a hyphen and one or more word characters, and also allows this last part to repeat.
Regex:
(?<![-\w])\w+(?:-\w+)+(?![-\w])
regex101 demo
Code:
import re
pattern = re.compile(r'(?<![-\w])\w+(?:-\w+)+(?![-\w])')
text = "-abc word semi-column peace -not-wanted- one-word dont-match- multi-hyphenated-word"
result = re.findall(pattern, text)
ideone demo
You can also use the following regex:
>>> st = "word semi-column peace"
>>> print re.findall(r"\S+\-\S+", st)
['semi-column']
You can try something like this: Centering on the hyphen, I match until there is a white space in either direction from the hyphen I also make check to see if the words are surrounded by hyphens (e.g -test-cats-) and if they are I make sure not to include them. The regular expression should also work with findall.
st = 'word semi-column peace'
m = re.search(r'([^ | ^-]+-[^ | ^-]+)', st)
if m:
print m.group(1)

Python regex boundary

Is there an error in the way python handles '.' or '\b'? I'm not sure why this produces differing results.
import re
regex1 = r'\.?\b'
print bool(re.match(regex1, '.'))
regex2 = r'a?\b'
print bool(re.match(regex2, 'a'))
Output:
False
True
\b, word boundary, matches between word characters and non-word elements. As such, it will match between a word character like a and the end of the string, but not between a non-word character like . and end of string.
As geekosaur pointed out \b is merely a short way of writing
(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
In your case you may want to use
(?!\w)
or
(?!\S)
instead of \b.

Categories