Python regex, searching for prefixes inside a target string - python

I need to find a list of prefixes of words inside a target string (I would like to have the list of matching indexes in the target string handled as an array).
I think using regex should be the cleanest way.
Given that I am looking for the pattern "foo", I would like to retrieve in the target string words like "foo", "Foo", "fooing", "Fooing"
Given that I am looking for the pattern "foo bar", I would like to retrieve in the target string patterns like "foo bar", "Foo bar", "foo Bar", "foo baring" (they are still all handled as prefixes, am I right?)
At the moment, after running it in different scenarios, my Python code still does not work.
I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
I am assuming I have to use something like ^[fF] to be case insensitive with the first letter of my prefix.
I am assuming I should use something like ".*" to let the regexp behave like a prefix.
I am assuming I should use the \prefix1|prefix2|prefix3** to put in **logic OR many different prefixes in the pattern to search.
The following source code does not work because I am wrongly setting the txt_pattern.
import re
# ' ' ' ' ' ' '
txt_str = "edb foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo";
txt_pattern = ''#???
out_obj = re.match(txt_pattern,txt_str)
if out_obj:
print "match!"
else:
print "No match!"
What am I missing?
How should I set the txt_pattern?
Can you please suggest me a good tutorial with minimum working examples? At the moment the standard tutorials from the first page of a Google search are very long and detailed, and not so simple to understand.
Thanks!

Regex is the wrong approach. First parse your string into a list of strings with one word per item. Then use a list comprehension with a filter. The split method on strings is a good way to get the list of words, then you can simply do [item for item in wordlist if item.startswith("foo")]
People spend ages hacking up inefficient code using convoluted regexes when all they need is a few string methods like split, partition, startswith and some pythonic list comprehensions or generators.
Regexes have their uses but simple string parsing is not one of them.

I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
No, the ^ is an anchor that only matches the start of the string. You can use \b instead, meaning a word boundary (but remember to escape the backslash inside a string literal, or use a raw string literal).
You will also have to use re.search instead of re.match because the latter only checks the start of the string, whereas the former searches for matches anywhere in the string.

>>> s = 'Foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo'
>>> regex = '((?i)(foo)(\w+)?)'
>>> compiled = re.compile(regex)
>>> re.findall(compiled, s)
[('Foooooo', 'Foo', 'oooo'), ('Fooooooo', 'Foo', 'ooooo'), ('fooing', 'foo', 'ing'), ('Fooing', 'Foo', 'ing'), ('foo', 'foo', ''), ('foo', 'foo', ''), ('Foo', 'Foo', '')]
(?i) -> case insensitive
(foo) -> group1 matches foo
(\w+) -> group2 matches every other word character
>>> print [i[0] for i in re.findall(compiled, s)]
['Foooooo', 'Fooooooo', 'fooing', 'Fooing', 'foo', 'foo', 'Foo']

Try using this tool to test some stuff: http://www.pythonregex.com/
Use this reference: docs.python.org/howto/regex.html

I would use something like this for your regex:
\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*
Inside of the non-capturing group you should separate each prefix with a |, I also placed each prefix inside of its own capturing group so you can tell which prefix a given string matched, for example:
for match in re.finditer(r'\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*', txt_str):
n = 1
while not match.group(n):
n += 1
print "Prefix %d matched '%s'" % (n, match.group(0))
Output:
Prefix 2 matched 'foooooo'
Prefix 2 matched 'Fooooooo'
Prefix 2 matched 'fooing'
Prefix 2 matched 'Fooing'
Prefix 1 matched 'foo baring'
Prefix 1 matched 'foo Bar'
Prefix 2 matched 'Foo'
Make sure you put longer prefixes first, if you were to put the foo prefix before the foo bar prefix, you would only match 'foo' in 'foo bar'.

Related

Using \b in a regex, trying not to match words that start with $

I'm having trouble getting the desired output using negative lookahead.
import re
text = "$FOO FOO $BAR BAR"
# Expected. Return words without 'F'.
re.findall(r"\b(?!F)\w+", text)
> ['BAR', 'BAR']
# Expected. Return words without 'B'.
re.findall(r"\b(?!B)\w+", text)
> ['FOO', 'FOO']
# Unexpected. Return words without '$'.
re.findall(r"\b(?!\$)\w+", text)
> ['FOO', 'FOO', 'BAR', 'BAR']
The first two work as expected. I expect the last one to return the list ['FOO', 'BAR'] matching words without the "$" character. Because it's a special character, I've tried various ways to escape it but haven't found the right solution.
You actually need to fix the pattern in the following way:
\b(?<!\$)\w+
See the Python demo.
The reason is that \b(?!\$)\w+ is equal to \b\w+ since $ cannot be matched with \w, so no need to restrict the first char matched with \w with the (?!\$) negative lookahead. You need to restrict the char that comes immediately before the first char matched wit \w, and that is done with a negative lookbehind, here, (?<!\$).
import re
text = "$FOO FOO $BAR BAR"
print(re.findall(r"\b(?<!\$)\w+", text))
# > ['FOO', 'BAR']
Now, as you say (?<=^)(?!\$)\w+|(?<=\s)(?!\$)\w+ works for you, you can now see that you may safely remove the lookaheads from the regex as they do not do anything meaningful, and the regex becomes (?<=^)\w+|(?<=\s)\w+. This expression can be shrunk further into a slim (?<!\S)\w+ pattern that matches any one or more word chars that are immediately preceded with start of string or a whitespace.
Thanks to Charles for putting me on the right track. I had an incorrect understanding of how boundary characters function.
import re
text = "FOO $FOO FOO $BAR BAR"
re.findall('(?<=^)(?!\$)\w+|(?<=\s)(?!\$)\w+', text)
> ['FOO', 'FOO', 'BAR']
Replacing \b with a negative look-behind that matched a space or the beginning of a string gives the desired output.

Python Regex exclude certain prefix

Given the following string
s = '"foo" "bar2baz_foo" foo( bar2baz_foo( p_foo p_foo.'
I need a regex such that
re.findall(regex, s)
gives
['foo', 'bar2baz_foo', 'foo', 'bar2baz_foo']
So it matches the first four "words" excluding the quotes and parentheses but not the last two.
I have tried a couple different things but nothing I can come up with actually works.
Hope someone here can help.
Edit: I should add that I want to replace the results with something else and not just find it, i.e. I wanna use re.sub and not re.findall. And also the string is the content of a text file in reality and therefore much longer. I just extracted the relevant bits.
If you're not hell-bent on a pure regex solution, you could use The Greatest Regex Trick Ever.
>>> s = '"foo" "bar2baz_foo" foo( bar2baz_foo( p_foo p_foo.'
>>> import re
>>> filter(None, re.findall(r'p_\w*|(\w+)', s))
['foo', 'bar2baz_foo', 'foo', 'bar2baz_foo']
Small demo for usage in re.sub:
>>> re.sub(r'p_\w*|(\w+)', lambda m: 'WORD' if m.group(1) else m.group(), s)
'"WORD" "WORD" WORD( WORD( p_foo p_foo.'

Python regular expression to replace everything but specific words

I am trying to do the following with a regular expression:
import re
x = re.compile('[^(going)|^(you)]') # words to replace
s = 'I am going home now, thank you.' # string to modify
print re.sub(x, '_', s)
The result I get is:
'_____going__o___no______n__you_'
The result I want is:
'_____going_________________you_'
Since the ^ can only be used inside brackets [], this result makes sense, but I'm not sure how else to go about it.
I even tried '([^g][^o][^i][^n][^g])|([^y][^o][^u])' but it yields '_g_h___y_'.
Not quite as easy as it first appears, since there is no "not" in REs except ^ inside [ ] which only matches one character (as you found). Here is my solution:
import re
def subit(m):
stuff, word = m.groups()
return ("_" * len(stuff)) + word
s = 'I am going home now, thank you.' # string to modify
print re.sub(r'(.+?)(going|you|$)', subit, s)
Gives:
_____going_________________you_
To explain. The RE itself (I always use raw strings) matches one or more of any character (.+) but is non-greedy (?). This is captured in the first parentheses group (the brackets). That is followed by either "going" or "you" or the end-of-line ($).
subit is a function (you can call it anything within reason) which is called for each substitution. A match object is passed, from which we can retrieve the captured groups. The first group we just need the length of, since we are replacing each character with an underscore. The returned string is substituted for that matching the pattern.
Here is a one regex approach:
>>> re.sub(r'(?!going|you)\b([\S\s]+?)(\b|$)', lambda x: (x.end() - x.start())*'_', s)
'_____going_________________you_'
The idea is that when you are dealing with words and you want to exclude them or etc. you need to remember that most of the regex engines (most of them use traditional NFA) analyze the strings by characters. And here since you want to exclude two word and want to use a negative lookahead you need to define the allowed strings as words (using word boundary) and since in sub it replaces the matched patterns with it's replace string you can't just pass the _ because in that case it will replace a part like I am with 3 underscore (I, ' ', 'am' ). So you can use a function to pass as the second argument of sub and multiply the _ with length of matched string to be replace.

Regular expression to return all characters between two special characters

How would I go about using regx to return all characters between two brackets.
Here is an example:
foobar['infoNeededHere']ddd
needs to return infoNeededHere
I found a regex to do it between curly brackets but all attempts at making it work with square brackets have failed. Here is that regex: (?<={)[^}]*(?=}) and here is my attempt to hack it
(?<=[)[^}]*(?=])
Final Solution:
import re
str = "foobar['InfoNeeded'],"
match = re.match(r"^.*\['(.*)'\].*$",str)
print match.group(1)
If you're new to REG(gular) EX(pressions) you learn about them at Python Docs. Or, if you want a gentler introduction, you can check out the HOWTO. They use Perl-style syntax.
Regex
The expression that you need is .*?\[(.*)\].*. The group that you want will be \1.
- .*?: . matches any character but a newline. * is a meta-character and means Repeat this 0 or more times. ? makes the * non-greedy, i.e., . will match up as few chars as possible before hitting a '['.
- \[: \ escapes special meta-characters, which in this case, is [. If we didn't do that, [ would do something very weird instead.
- (.*): Parenthesis 'groups' whatever is inside it and you can later retrieve the groups by their numeric IDs or names (if they're given one).
- \].*: You should know enough by now to know what this means.
Implementation
First, import the re module -- it's not a built-in -- to where-ever you want to use the expression.
Then, use re.search(regex_pattern, string_to_be_tested) to search for the pattern in the string to be tested. This will return a MatchObject which you can store to a temporary variable. You should then call it's group() method and pass 1 as an argument (to see the 'Group 1' we captured using parenthesis earlier). I should now look like:
>>> import re
>>> pat = r'.*?\[(.*)].*' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd"
>>> match = re.search(pat, s)
>>> match.group(1)
"'infoNeededHere'"
An Alternative
You can also use findall() to find all the non-overlapping matches by modifying the regex to (?>=\[).+?(?=\]).
- (?<=\[): (?<=) is called a look-behind assertion and checks for an expression preceding the actual match.
- .+?: + is just like * except that it matches one or more repititions. It is made non-greedy by ?.
- (?=\]): (?=) is a look-ahead assertion and checks for an expression following the match w/o capturing it.
Your code should now look like:
>>> import re
>>> pat = r'(?<=\[).+?(?=\])' #See Note at the bottom of the answer
>>> s = "foobar['infoNeededHere']ddd[andHere] [andOverHereToo[]"
>>> re.findall(pat, s)
["'infoNeededHere'", 'andHere', 'andOverHereToo[']
Note: Always use raw Python strings by adding an 'r' before the string (E.g.: r'blah blah blah').
10x for reading! I wrote this answer when there were no accepted ones yet, but by the time I finished it, 2 ore came up and one got accepted. :( x<
^.*\['(.*)'\].*$ will match a line and capture what you want in a group.
You have to escape the [ and ] with \
The documentation at the rubular.com proof link will explain how the expression is formed.
If there's only one of these [.....] tokens per line, then you don't need to use regular expressions at all:
In [7]: mystring = "Bacon, [eggs], and spam"
In [8]: mystring[ mystring.find("[")+1 : mystring.find("]") ]
Out[8]: 'eggs'
If there's more than one of these per line, then you'll need to modify Jarrod's regex ^.*\['(.*)'\].*$ to match multiple times per line, and to be non greedy. (Use the .*? quantifier instead of the .* quantifier.)
In [15]: mystring = "[Bacon], [eggs], and [spam]."
In [16]: re.findall(r"\[(.*?)\]",mystring)
Out[16]: ['Bacon', 'eggs', 'spam']

Python regular expression to match either a quoted or unquoted string

I am trying to write a regular expression in Python that will match either a quoted string with spaces or an unquoted string without spaces. For example given the string term:foo the result would be foo and given the string term:"foo bar" the result would be foo bar. So far I've come up with the following regular expression:
r = re.compile(r'''term:([^ "]+)|term:"([^"]+)"''')
The problem is that the match can come in either group(1) or group(2) so I have to do something like this:
m = r.match(search_string)
term = m.group(1) or m.group(2)
Is there a way I can do this all in one step?
Avoid grouping, and instead use lookahead/lookbehind assertions to eliminate the parts that are not needed:
s = 'term:foo term:"foo bar" term:bar foo term:"foo term:'
re.findall(r'(?<=term:)[^" ]+|(?<=term:")[^"]+(?=")', s)
Gives:
['foo', 'foo bar', 'bar']
It doesn't seem that you really want re.match here. Your regex is almost right, but you're grouping too much. How about this?
>>> s
('xyz term:abc 123 foo', 'foo term:"abc 123 "foo')
>>> re.findall(r'term:([^ "]+|"[^"]+")', '\n'.join(s))
['abc', '"abc 123 "']

Categories