Normally we would write the following to replace one match:
namesRegex = re.compile(r'(is)|(life)', re.I)
replaced = namesRegex.sub(r"butter", "There is no life in the void.")
print(replaced)
output:
There butter no butter in the void.
What i want is to replace, probably using back references, each group with a specific text. Namely i want to replace the first group (is) with "are" and the second group (life) with "butterflies".
Maybe something like that. But the following is not working code.
namesRegex = re.compile(r'(is)|(life)', re.I)
replaced = namesRegex.sub(r"(are) (butterflies)", r"\1 \2", "There is no life in the void.")
print(replaced)
Is there a way to replace multiple groups in one statement in python?
You can use a replacement by lambda, mapping the keywords you want to associate:
>>> re.sub(r'(is)|(life)', lambda x: {'is': 'are', 'life': 'butterflies'}[x.group(0)], "There is no life in the void.")
'There are no butterflies in the void.'
You can define a map of keys and replacements first and then use a lambda function in replacement:
>>> repl = {'is': 'are', 'life': 'butterflies'}
>>> print re.sub(r'is|life', lambda m: repl[m.group()], "There is no life in the void.")
There are no butterflies in the void.
I will also suggest you to use word boundaries around your keys to safeguard your search patterns:
>>> print re.sub(r'\b(?:is|life)\b', lambda m: repl[m.group()], "There is no life in the void.")
There are no butterflies in the void.
You may use a dictionary with search-replacement values and use a simple \w+ regex to match words:
import re
dt = {'is' : 'are', 'life' : 'butterflies'}
namesRegex = re.compile(r'\w+')
replaced = namesRegex.sub(lambda m: dt[m.group()] if m.group() in dt else m.group(), "There is no life in the void.")
print(replaced)
See a Python demo
With this approach, you do not have to worry about creating a too large regex pattern based on alternation. You may adjust the pattern to include word boundaries, or only match letters (e.g. [\W\d_]+), etc. as per the requirements. The main point is that the pattern should match all the search terms that are keys in the dictionary.
The if m.group() in dt else m.group() part is checking if the found match is present as a key in the dictionary, and if it is not, just returns the match back. Else, the value from the dictionary is returned.
If you want just to replace specific words, go no further than str.replace().
s = "There is no life in the void."
s.replace('is', 'are').replace('life', 'butterflies') # => 'There are no butterflies in the void.'
Related
I'm learning about regular expression. I don't know how to combine different regular expression to make a single generic regular expression.
I want to write a single regular expression which works for multiple cases. I know this is can be done with naive approach by using or " | " operator.
I don't like this approach. Can anybody tell me better approach?
You need to compile all your regex functions. Check this example:
import re
re1 = r'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*'
re2 = '\d*[/]\d*[A-Z]*\d*\s[A-Z]*\d*[A-Z]*'
re3 = '[A-Z]*\d+[/]\d+[A-Z]\d+'
re4 = '\d+[/]\d+[A-Z]*\d+\s\d+[A-Z]\s[A-Z]*'
sentences = [string1, string2, string3, string4]
for sentence in sentences:
generic_re = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).findall(sentence)
To findall with an arbitrary series of REs all you have to do is concatenate the list of matches which each returns:
re_list = [
'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*', # re1 in question,
...
'\d+[/]\d+[A-Z]*\d+\s\d+[A-z]\s[A-Z]*', # re4 in question
]
matches = []
for r in re_list:
matches += re.findall( r, string)
For efficiency it would be better to use a list of compiled REs.
Alternatively you could join the element RE strings using
generic_re = re.compile( '|'.join( re_list) )
I see lots of people are using pipes, but that seems to only match the first instance. If you want to match all, then try using lookaheads.
Example:
>>> fruit_string = "10a11p"
>>> fruit_regex = r'(?=.*?(?P<pears>\d+)p)(?=.*?(?P<apples>\d+)a)'
>>> re.match(fruit_regex, fruit_string).groupdict()
{'apples': '10', 'pears': '11'}
>>> re.match(fruit_regex, fruit_string).group(0)
'10a,11p'
>>> re.match(fruit_regex, fruit_string).group(1)
'11'
(?= ...) is a look ahead:
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
.*?(?P<pears>\d+)p
find a number followed a p anywhere in the string and name the number "pears"
You might not need to compile both regex patterns. Here is a way, let's see if it works for you.
>>> import re
>>> text = 'aaabaaaabbb'
>>> A = 'aaa'
>>> B = 'bbb'
>>> re.findall(A+B, text)
['aaabbb']
>>>
further read read_doc
If you need to squash multiple regex patterns together the result can be annoying to parse--unless you use P<?> and .groupdict() but doing that can be pretty verbose and hacky. If you only need a couple matches then doing something like the following could be mostly safe:
bucket_name, blob_path = tuple(item for item in matches.groups() if item is not None)
I want to find out if a string contains the word "randomize". This word my exist in and outside of brackets in the string but I am only interested if the word exists IN SIDE of the brackets.
mystring = "You said {single order='randomize'} that P.E is...Why?"
I understand that i have to use regex for this but my attampts have failed thus far.
Essentially I want to say:
look for "randomize" and check if its in brackets.
Thanks
You could use some negated classes:
>>> import re
>>> mystring = "You said {single order='randomize'} that P.E is...Why?"
>>> if mystring.find("randomize") != -1:
... if re.search(r'{[^{}]*randomize[^{}]*}', mystring):
... print("'randomize' present within braces")
... else:
... print("'randomize' present but not within braces")
... else:
... print("'randomize' absent")
# => 'randomize' present within braces
This is the kind of thing that's very difficult for regex to do. You see if you do something like re.escape(r"{.*?randomize.*?}"), you can match something like "Hello there, I'm going to {break} your randomize regex {foobar}" and it will return "{break} your randomize regex {foobar}". You can probably pull this off with lookahead and lookbehind assertions, but not without telling us if the brackets can be nested, since this will then fail on "I'm going to break you {now with randomize {nested} brackets}"
As per your update that the brackets will never be nested, this regex should match:
re.search("{[^}]*?randomize.*?}", mystring)
And you can access the group using .group(0). Put it all together to do something like:
for mystring in group_of_strings_to_test:
if re.search("{[^}]*?randomize.*?}", mystring).group(0):
# it has "randomize" in a bracket
else:
# it doesn't.
To assure you're not inside nested {}'s it could be
{[^{}]*randomize[^{}]*}
The naive simple method:
>>> import re
>>> mystring = "You said {single order='randomize'} that P.E is...Why?"
>>> print re.search('{.*randomize.*}', mystring).group(0)
Once we have this, we can improve it bit by bit. For instance, this is called a greedy regex, which means:
>>> print re.search('{.*randomize*}', "{FOO {randomize} BAR}").group(0)
{FOO {randomize} BAR}
You'll probably want it to be non-greedy, so you should use '.*?' instead:
>>> print re.search('{.*?randomize.*?}', mystring).group(0)
Also, it will not handle nested:
>>> print re.search('{.*?randomize.*?}', "{FOO} randomize {BAR}").group(0)
{FOO} randomize {BAR}
If you want to handle simple nested, you may want to match all characters except other brackets.
>>> print re.search('{[^}]*randomize[^{]*}', mystring).group(0)
The goal is to prefix and suffix all occurrences of a substring (case-insensitive) in a source string. I basically need to figure out how to get from source_str to target_str.
source_str = 'You ARe probably familiaR with wildcard'
target_str = 'You [b]AR[/b]e probably famili[b]aR[/b] with wildc[b]ar[/b]d'
In this example, I am finding all occurrences of 'ar' (case insensitive) and replacing each occurrence by itself (i.e. AR, aR and ar respectively), with a prefix ([b])and suffix ([/b]).
>>> import re
>>> source_str = 'You ARe probably familiaR with wildcard'
>>> re.sub(r"(ar)", r"[b]\1[/b]", source_str, flags=re.IGNORECASE)
'You [b]AR[/b]e probably famili[b]aR[/b] with wildc[b]ar[/b]d'
Something like
import re
ar_re = re.compile("(ar)", re.I)
print ar_re.sub(r"[b]\1[/b]", "You ARe probably familiaR with wildcard")
perhaps?
I am close but I am not sure what to do with the restuling match object. If I do
p = re.search('[/#.* /]', str)
I'll get any words that start with # and end up with a space. This is what I want. However this returns a Match object that I dont' know what to do with. What's the most computationally efficient way of finding and returning a string which is prefixed with a #?
For example,
"Hi there #guy"
After doing the proper calculations, I would be returned
guy
The following regular expression do what you need:
import re
s = "Hi there #guy"
p = re.search(r'#(\w+)', s)
print p.group(1)
It will also work for the following string formats:
s = "Hi there #guy " # notice the trailing space
s = "Hi there #guy," # notice the trailing comma
s = "Hi there #guy and" # notice the next word
s = "Hi there #guy22" # notice the trailing numbers
s = "Hi there #22guy" # notice the leading numbers
That regex does not do what you think it does.
s = "Hi there #guy"
p = re.search(r'#([^ ]+)', s) # this is the regex you described
print p.group(1) # first thing matched inside of ( .. )
But as usually with regex, there are tons of examples that break this, for example if the text is s = "Hi there #guy, what's with the comma?" the result would be guy,.
So you really need to think about every possible thing you want and don't want to match. r'#([a-zA-Z]+)' might be a good starting point, it literally only matches letters (a .. z, no unicode etc).
p.group(0) should return guy. If you want to find out what function an object has, you can use the dir(p) method to find out. This will return a list of attributes and methods that are available for that object instance.
As it's evident from the answers so far regex is the most efficient solution for your problem. Answers differ slightly regarding what you allow to be followed by the #:
[^ ] anything but space
\w in python-2.x is equivalent to [A-Za-z0-9_], in py3k is locale dependent
If you have better idea what characters might be included in the user name you might adjust your regex to reflect that, e.g., only lower case ascii letters, would be:
[a-z]
NB: I skipped quantifiers for simplicity.
(?<=#)\w+
will match a word if it's preceded by a # (without adding it to the match, a so-called positive lookbehind). This will match "words" that are composed of letters, numbers, and/or underscore; if you don't want those, use (?<=#)[^\W\d_]+
In Python:
>>> strg = "Hi there #guy!"
>>> p = re.search(r'(?<=#)\w+', strg)
>>> p.group()
'guy'
You say: """If I do p = re.search('[/#.* /]', str) I'll get any words that start with # and end up with a space."" But this is incorrect -- that pattern is a character class which will match ONE character in the set #/.* and space. Note: there's a redundant second / in the pattern.
For example:
>>> re.findall('[/#.* /]', 'xxx#foo x/x.x*x xxxx')
['#', ' ', '/', '.', '*', ' ']
>>>
You say that you want "guy" returned from "Hi there #guy" but that conflicts with "and end up with a space".
Please edit your question to include what you really want/need to match.
How can I extract the longest of groups which start the same way
For example, from a given string, I want to extract the longest match to either CS or CSI.
I tried this "(CS|CSI).*" and it it will return CS rather than CSI even if CSI is available.
If I do "(CSI|CS).*" then I do get CSI if it's a match, so I gues the solution is to always place the shorter of the overlaping groups after the longer one.
Is there a clearer way to express this with re's? somehow it feels confusing that the result depends on the order you link the groups.
No, that's just how it works, at least in Perl-derived regex flavors like Python, JavaScript, .NET, etc.
http://www.regular-expressions.info/alternation.html
As Alan says, the patterns will be matched in the order you specified them.
If you want to match on the longest of overlapping literal strings, you need the longest one to appear first. But you can organize your strings longest-to-shortest automatically, if you like:
>>> '|'.join(sorted('cs csi miami vice'.split(), key=len, reverse=True))
'miami|vice|csi|cs'
Intrigued to know the right way of doing this, if it helps any you can always build up your regex like:
import re
string_to_look_in = "AUHDASOHDCSIAAOSLINDASOI"
string_to_match = "CSIABC"
re_to_use = "(" + "|".join([string_to_match[0:i] for i in range(len(string_to_match),0,-1)]) + ")"
re_result = re.search(re_to_use,string_to_look_in)
print string_to_look_in[re_result.start():re_result.end()]
similar functionality is present in vim editor ("sequence of optionally matched atoms"), where e.g. col\%[umn] matches col in color, colum in columbus and full column.
i am not aware if similar functionality in python re,
you can use nested anonymous groups, each one followed by ? quantifier, for that:
>>> import re
>>> words = ['color', 'columbus', 'column']
>>> rex = re.compile(r'col(?:u(?:m(?:n)?)?)?')
>>> for w in words: print rex.findall(w)
['col']
['colum']
['column']