Putting words in parenthesis with regex python - python

How can I put in brackets / parenthesis some words following another word in python?
For 2 words it looks like:
>>> p=re.compile(r"foo\s(\w+)\s(\w+)")
>>> p.sub( r"[\1] [\2]", "foo bar baz")
'[bar] [baz]'
I want for undefined number of words. I came up with this, but it doesn't seem to work.
>>> p=re.compile(r"foo(\s(\w+))*")
>>> p.sub( r"[\2] [\2] [\2]", "foo bar baz bax")
'[bax] [bax] [bax]'
The desired result in this case would be
'[bar] [baz] [bax]'

You may use a solution like
import re
p = re.compile(r"(foo\s+)([\w\s]+)")
r = re.compile(r"\w+")
s = "foo bar baz"
print( p.sub( lambda x: "{}{}".format(x.group(1), r.sub(r"[\g<0>]", x.group(2))), s) )
See the Python demo
The first (foo\s+)([\w\s]+) pattern matches and captures foo followed with 1+ whitespaces into Group 1 and then captures 1+ word and whitespace chars into Group 2.
Then, inside the re.sub, the replacement argument is a lambda expression where all 1+ word chunks are wrapped with square brackets using the second simple \w+ regex (that is done to ensure the same amount of whitespaces between the words, else, it can be done without a regex).
Note that [\g<0>] replacement pattern inserts [, the whole match value (\g<0>) and then ].

I suggest you the following simple solution:
import re
s = "foo bar baz bu bi porte"
p = re.compile(r"foo\s([\w\s]+)")
p = p.match(s)
# Here: p.group(1) is "bar baz bu bi porte"
# p.group(1).split is ['bar', 'baz' ,'bu' ,'bi', 'porte']
print(' '.join([f'[{i}]' for i in p.group(1).split()])) # for Python 3.6+ (due to f-strings)
# [bar] [baz] [bu] [bi] [porte]
print(' '.join(['[' + i + ']' for i in p.group(1).split()])) # for other Python versions
# [bar] [baz] [bu] [bi] [porte]

Related

How to get the text that is separated by a comma after a keyword

so I have a string that has multiple patterns like
s1 = "foo, bar"
s1 = "x, y"
s2 = "hello, hi"
s3 = "bar, foo."
I'm wondering how I can get the strings that are separated by a comma (insert random text here).
So from this example, I want to get strings ["foo","bar"] and ["x","y"] when I'm looking for "s1", and "hello" & "hi" when I look for s2, etc.
Thanks!
EDIT:
Let's assume using .split(',') is impractical due to a large number of commas outside this specific pattern I listed
The question was edited, but for for the original string:
"s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
You could use a pattern to match the specific part and then use re.split to split on a comma and optional space.
\bs1: ?(\w+(?:, ?\w+)*)
Explanation
\bs1: ? Match s1: and optional space
( Capture group 1
\w+(?:, ?\w+)* Match 1+ word chars, optionally repeat comma, optional space and 1+ word chars
) Close group 1
Regex demo | Python demo
Example code (python 3)
import re
s = "s1: foo, bar s1:x,y s2:hello, hi s3: bar, foo."
def findByPrefix(prefix, s):
pattern = rf"\b{re.escape(prefix)}: ?(\w+(?:, ?\w+)*)"
res = []
for m in re.findall(pattern, s):
res.append(re.split(", ?", m))
return res
print(findByPrefix("s1", s))
Output
[['foo', 'bar'], ['x', 'y']]
You can use:
my_string.split(',')
It should return a list of every element.
Here is how you can use the re module to split a string by a given delimiter:
import re
re.split(", ", my_string)

What is word boundary while using regex in python

What is a word boundary in a Python regex? Can someone please explain this on these examples:
Example 1
>>> x = '456one two three123'
>>> y=re.search(r"\btwo\b",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>
Example 2
>>> y=re.search(r"two",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>
Example 3
>>> ip="192.168.254.1234"
>>> if re.search(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip):
... print ip
...
Example 4
>>> ip="192.168.254.1234"
>>> if re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip):
... print ip
192.168.254.1234
"word boundary" means exactly what it says: the boundary of a word, i.e. either the beginning or the end.
It does not match any actual character in the input, but it will only match if the current match position is at the beginning or end of the word.
This is important because, unlike if you just matched whitespace, it will also match at the beginning or end of the entire input.
So '\bfoo' will match 'foobar' and 'foo bar' and 'bar foo', but not 'barfoo'.
'foo\b' will match 'foo bar' and 'bar foo' and 'barfoo', but not 'foobar'.
Try this:
ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip)
print(res)
Notice how I correctly escaped the dots.
The ip is found because the regex doesn't care what comes after the last 1-3 digits.
Now:
ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip)
print(res)
This will not work, since the last 1-3 digits are NOT ENDING AT A BOUNDARY.

Python regex, searching for prefixes inside a target string

I need to find a list of prefixes of words inside a target string (I would like to have the list of matching indexes in the target string handled as an array).
I think using regex should be the cleanest way.
Given that I am looking for the pattern "foo", I would like to retrieve in the target string words like "foo", "Foo", "fooing", "Fooing"
Given that I am looking for the pattern "foo bar", I would like to retrieve in the target string patterns like "foo bar", "Foo bar", "foo Bar", "foo baring" (they are still all handled as prefixes, am I right?)
At the moment, after running it in different scenarios, my Python code still does not work.
I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
I am assuming I have to use something like ^[fF] to be case insensitive with the first letter of my prefix.
I am assuming I should use something like ".*" to let the regexp behave like a prefix.
I am assuming I should use the \prefix1|prefix2|prefix3** to put in **logic OR many different prefixes in the pattern to search.
The following source code does not work because I am wrongly setting the txt_pattern.
import re
# ' ' ' ' ' ' '
txt_str = "edb foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo";
txt_pattern = ''#???
out_obj = re.match(txt_pattern,txt_str)
if out_obj:
print "match!"
else:
print "No match!"
What am I missing?
How should I set the txt_pattern?
Can you please suggest me a good tutorial with minimum working examples? At the moment the standard tutorials from the first page of a Google search are very long and detailed, and not so simple to understand.
Thanks!
Regex is the wrong approach. First parse your string into a list of strings with one word per item. Then use a list comprehension with a filter. The split method on strings is a good way to get the list of words, then you can simply do [item for item in wordlist if item.startswith("foo")]
People spend ages hacking up inefficient code using convoluted regexes when all they need is a few string methods like split, partition, startswith and some pythonic list comprehensions or generators.
Regexes have their uses but simple string parsing is not one of them.
I am assuming I have to use ^ to match the beginning of a word in a target string (i.e. a prefix).
No, the ^ is an anchor that only matches the start of the string. You can use \b instead, meaning a word boundary (but remember to escape the backslash inside a string literal, or use a raw string literal).
You will also have to use re.search instead of re.match because the latter only checks the start of the string, whereas the former searches for matches anywhere in the string.
>>> s = 'Foooooo jkds Fooooooo kj fooing jdcnj Fooing ujndn ggng sxk foo baring sh foo Bar djw Foo'
>>> regex = '((?i)(foo)(\w+)?)'
>>> compiled = re.compile(regex)
>>> re.findall(compiled, s)
[('Foooooo', 'Foo', 'oooo'), ('Fooooooo', 'Foo', 'ooooo'), ('fooing', 'foo', 'ing'), ('Fooing', 'Foo', 'ing'), ('foo', 'foo', ''), ('foo', 'foo', ''), ('Foo', 'Foo', '')]
(?i) -> case insensitive
(foo) -> group1 matches foo
(\w+) -> group2 matches every other word character
>>> print [i[0] for i in re.findall(compiled, s)]
['Foooooo', 'Fooooooo', 'fooing', 'Fooing', 'foo', 'foo', 'Foo']
Try using this tool to test some stuff: http://www.pythonregex.com/
Use this reference: docs.python.org/howto/regex.html
I would use something like this for your regex:
\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*
Inside of the non-capturing group you should separate each prefix with a |, I also placed each prefix inside of its own capturing group so you can tell which prefix a given string matched, for example:
for match in re.finditer(r'\b(?:([Ff]oo [Bb]ar)|([Ff]oo))\w*', txt_str):
n = 1
while not match.group(n):
n += 1
print "Prefix %d matched '%s'" % (n, match.group(0))
Output:
Prefix 2 matched 'foooooo'
Prefix 2 matched 'Fooooooo'
Prefix 2 matched 'fooing'
Prefix 2 matched 'Fooing'
Prefix 1 matched 'foo baring'
Prefix 1 matched 'foo Bar'
Prefix 2 matched 'Foo'
Make sure you put longer prefixes first, if you were to put the foo prefix before the foo bar prefix, you would only match 'foo' in 'foo bar'.

Python regular expression to match either a quoted or unquoted string

I am trying to write a regular expression in Python that will match either a quoted string with spaces or an unquoted string without spaces. For example given the string term:foo the result would be foo and given the string term:"foo bar" the result would be foo bar. So far I've come up with the following regular expression:
r = re.compile(r'''term:([^ "]+)|term:"([^"]+)"''')
The problem is that the match can come in either group(1) or group(2) so I have to do something like this:
m = r.match(search_string)
term = m.group(1) or m.group(2)
Is there a way I can do this all in one step?
Avoid grouping, and instead use lookahead/lookbehind assertions to eliminate the parts that are not needed:
s = 'term:foo term:"foo bar" term:bar foo term:"foo term:'
re.findall(r'(?<=term:)[^" ]+|(?<=term:")[^"]+(?=")', s)
Gives:
['foo', 'foo bar', 'bar']
It doesn't seem that you really want re.match here. Your regex is almost right, but you're grouping too much. How about this?
>>> s
('xyz term:abc 123 foo', 'foo term:"abc 123 "foo')
>>> re.findall(r'term:([^ "]+|"[^"]+")', '\n'.join(s))
['abc', '"abc 123 "']

Is there a single Python regex that can change all "foo" to "bar" on lines starting with "#"?

Is it possible to write a single Python regular expression that can be applied to a multi-line string and change all occurrences of "foo" to "bar", but only on lines beginning with "#"?
I was able to get this working in Perl, using Perl's \G regular expression sigil, which matches the end of the previous match. However, Python doesn't appear to support this.
Here's the Perl solution, in case it helps:
my $x =<<EOF;
# foo
foo
# foo foo
EOF
$x =~ s{
( # begin capture
(?:\G|^\#) # last match or start of string plus hash
.*? # followed by anything, non-greedily
) # end capture
foo
}
{$1bar}xmg;
print $x;
The proper output, of course, is:
# bar
foo
# bar bar
Can this be done in Python?
Edit: Yes, I know that it's possible to split the string into individual lines and test each line and then decide whether to apply the transformation, but please take my word that doing so would be non-trivial in this case. I really do need to do it with a single regular expression.
lines = mystring.split('\n')
for line in lines:
if line.startswith('#'):
line = line.replace('foo', 'bar')
No need for a regex.
It looked pretty easy to do with a regular expression:
>>> import re
... text = """line 1
... line 2
... Barney Rubble Cutherbert Dribble and foo
... line 4
... # Flobalob, bing, bong, foo and brian
... line 6"""
>>> regexp = re.compile('^(#.+)foo', re.MULTILINE)
>>> print re.sub(regexp, '\g<1>bar', text)
line 1
line 2
Barney Rubble Cutherbert Dribble and foo
line 4
# Flobalob, bing, bong, bar and brian
line 6
But then trying your example text is not so good:
>>> text = """# foo
... foo
... # foo foo"""
>>> regexp = re.compile('^(#.+)foo', re.MULTILINE)
>>> print re.sub(regexp, '\g<1>bar', text)
# bar
foo
# foo bar
So, try this:
>>> regexp = re.compile('(^#|\g.+)foo', re.MULTILINE)
>>> print re.sub(regexp, '\g<1>bar', text)
# foo
foo
# foo foo
That seemed to work, but I can't find \g in the documentation!
Moral: don't try to code after a couple of beers.
\g works in python just like perl, and is in the docs.
"In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE."

Categories