I am trying to replace some text with something else that depends on the original text in Python3. For example, say I have "[[procedural programming|procedural programming languages]]", I need to replace that with the later text, so just procedural programming languages.
In general, I need a function which takes a string and a function and applies the function to the string and then replaces it. For example, reversing a string could be done like so:
text = "123456 123456 84708467 11235"
new_text = special_replace(text, lambda x: x[::-1])
>>> 654321 654321 84708467 11235
Or the previous example:
text = "[[procedural programming|procedural programming languages]] [meow|woof]"
new_text = specail_replace(text, lambda x: x.replace("[[", "").replace("]]","").split("|")[1])
>>> procedural programming languages [meow|woof]
You can create a regular expression and use re.sub with group reference to replace them:
>>> text = "[[procedural programming|procedural programming languages]] [meow|woof]"
>>> p = r"\[\[.*?\|(.*?)\]\]"
>>> re.findall(p, text)
['procedural programming languages']
>>> re.sub(p, r"\1", text)
'procedural programming languages [meow|woof]'
Note that [, |, and ] all have to be escaped. Here (.*?) is a capturing group for the second term, and \1 references that group in the replacement string.
For more complex stuff, like also reversing the group, you can use a callback function:
>>> re.sub(p, lambda m: m.group(1)[::-1], text)
'segaugnal gnimmargorp larudecorp [meow|woof]'
Related
Here's the function I'm trying to create,
def compile_italics(line):
It's a markdown to HTML compiler
If the line input is
compile_italics('*italic* not italic.')
I need to find a way to replace every * with <i> and </i> respectively. Problem is, I don't know how to distinguish between the first and second *.
Ideally, the output should be:
'<i>italic</i> not italic.'
Other examples:
compile_italic('*italic*')
>>>'<i>italic</i>'
compile_italic('this is *italic*!')
>>> 'this is <i>italic</i>!'
compile_italic_star('*italic*, and *italic*!')
>>> '<i>italic</i>, and <i>italic</i>!'
compile_italic_star('not *italic')
>>> not *italic'
For this scenario, a regular expression is a viable option. That kind of replacement can be achieved using a function as replace argument of the re.sub method:
import re
def replace(match):
match1 = match.group(1)
return f"<i>{match1}</i>"
your_string = "*example* hello *example2*"
result = re.sub(r"\*([^\*]+)\*", replace, your_string)
print(result)
# <i>example</i> hello <i>example2</i>
result = re.sub(r"\*([^\*]+)\*", replace, "*italic*, and not *italic")
print(result)
# <i>italic</i>, and not *italic
The parts of the regular expression:
\* - a star (the first)
([^\*]+) - any number of characters that are not star, grouped
\* - another star (the second one that closes the pattern)
If you want to handle double stars to support bold, then you might handle first this pattern: r"\*\*([^\*]+)\*\*", and then this pattern: r"\*([^\*]+)\*".
I'm learning about regular expression. I don't know how to combine different regular expression to make a single generic regular expression.
I want to write a single regular expression which works for multiple cases. I know this is can be done with naive approach by using or " | " operator.
I don't like this approach. Can anybody tell me better approach?
You need to compile all your regex functions. Check this example:
import re
re1 = r'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*'
re2 = '\d*[/]\d*[A-Z]*\d*\s[A-Z]*\d*[A-Z]*'
re3 = '[A-Z]*\d+[/]\d+[A-Z]\d+'
re4 = '\d+[/]\d+[A-Z]*\d+\s\d+[A-Z]\s[A-Z]*'
sentences = [string1, string2, string3, string4]
for sentence in sentences:
generic_re = re.compile("(%s|%s|%s|%s)" % (re1, re2, re3, re4)).findall(sentence)
To findall with an arbitrary series of REs all you have to do is concatenate the list of matches which each returns:
re_list = [
'\d+\.\d*[L][-]\d*\s[A-Z]*[/]\d*', # re1 in question,
...
'\d+[/]\d+[A-Z]*\d+\s\d+[A-z]\s[A-Z]*', # re4 in question
]
matches = []
for r in re_list:
matches += re.findall( r, string)
For efficiency it would be better to use a list of compiled REs.
Alternatively you could join the element RE strings using
generic_re = re.compile( '|'.join( re_list) )
I see lots of people are using pipes, but that seems to only match the first instance. If you want to match all, then try using lookaheads.
Example:
>>> fruit_string = "10a11p"
>>> fruit_regex = r'(?=.*?(?P<pears>\d+)p)(?=.*?(?P<apples>\d+)a)'
>>> re.match(fruit_regex, fruit_string).groupdict()
{'apples': '10', 'pears': '11'}
>>> re.match(fruit_regex, fruit_string).group(0)
'10a,11p'
>>> re.match(fruit_regex, fruit_string).group(1)
'11'
(?= ...) is a look ahead:
Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
.*?(?P<pears>\d+)p
find a number followed a p anywhere in the string and name the number "pears"
You might not need to compile both regex patterns. Here is a way, let's see if it works for you.
>>> import re
>>> text = 'aaabaaaabbb'
>>> A = 'aaa'
>>> B = 'bbb'
>>> re.findall(A+B, text)
['aaabbb']
>>>
further read read_doc
If you need to squash multiple regex patterns together the result can be annoying to parse--unless you use P<?> and .groupdict() but doing that can be pretty verbose and hacky. If you only need a couple matches then doing something like the following could be mostly safe:
bucket_name, blob_path = tuple(item for item in matches.groups() if item is not None)
I'm translating a program from Perl to Python (3.3). I'm fairly new with Python. In Perl, I can do crafty regex substitutions, such as:
$string =~ s/<(\w+)>/$params->{$1}/g;
This will search through $string, and for each group of word characters enclosed in <>, a substitution from the $params hashref will occur, using the regex match as the hash key.
What is the best (Pythonic) way to concisely replicate this behavior? I've come up with something along these lines:
string = re.sub(r'<(\w+)>', (what here?), string)
It might be nice if I could pass a function that maps regex matches to a dict. Is that possible?
Thanks for the help.
You can pass a callable to re.sub to tell it what to do with the match object.
s = re.sub(r'<(\w+)>', lambda m: replacement_dict.get(m.group()), s)
use of dict.get allows you to provide a "fallback" if said word isn't in the replacement dict, i.e.
lambda m: replacement_dict.get(m.group(), m.group())
# fallback to just leaving the word there if we don't have a replacement
I'll note that when using re.sub (and family, ie re.split), when specifying stuff that exists around your wanted substitution, it's often cleaner to use lookaround expressions so that the stuff around your match doesn't get subbed out. So in this case I'd write your regex like
r'(?<=<)(\w+)(?=>)'
Otherwise you have to do some splicing out/back in of the brackets in your lambda. To be clear what I'm talking about, an example:
s = "<sometag>this is stuff<othertag>this is other stuff<closetag>"
d = {'othertag': 'blah'}
#this doesn't work because `group` returns the whole match, including non-groups
re.sub(r'<(\w+)>', lambda m: d.get(m.group(), m.group()), s)
Out[23]: '<sometag>this is stuff<othertag>this is other stuff<closetag>'
#this output isn't exactly ideal...
re.sub(r'<(\w+)>', lambda m: d.get(m.group(1), m.group(1)), s)
Out[24]: 'sometagthis is stuffblahthis is other stuffclosetag'
#this works, but is ugly and hard to maintain
re.sub(r'<(\w+)>', lambda m: '<{}>'.format(d.get(m.group(1), m.group(1))), s)
Out[26]: '<sometag>this is stuff<blah>this is other stuff<closetag>'
#lookbehind/lookahead makes this nicer.
re.sub(r'(?<=<)(\w+)(?=>)', lambda m: d.get(m.group(), m.group()), s)
Out[27]: '<sometag>this is stuff<blah>this is other stuff<closetag>'
I have some string X and I wish to remove semicolons, periods, commas, colons, etc, all in one go. Is there a way to do this that doesn't require a big chain of .replace(somechar,"") calls?
You can use the translate method with a first argument of None:
string2 = string1.translate(None, ";.,:")
Alternatively, you can use the filter function:
string2 = filter(lambda x: x not in ";,.:", string1)
Note that both of these options only work for non-Unicode strings and only in Python 2.
You can use re.sub to pattern match and replace. The following replaces h and i only with empty strings:
In [1]: s = 'byehibyehbyei'
In [1]: re.sub('[hi]', '', s)
Out[1]: 'byebyebye'
Don't forget to import re.
>>> import re
>>> foo = "asdf;:,*_-"
>>> re.sub('[;:,*_-]', '', foo)
'asdf'
[;:,*_-] - List of characters to be matched
'' - Replace match with nothing
Using the string foo.
For more information take a look at the re.sub(pattern, repl, string, count=0, flags=0) documentation.
Don't know about the speed, but here's another example without using re.
commas_and_stuff = ",+;:"
words = "words; and stuff!!!!"
cleaned_words = "".join(c for c in words if c not in commas_and_stuff)
Gives you:
'words and stuff!!!!'
I have dynamic regexp in which I don't know in advance how many groups it has
I would like to replace all matches with xml tags
example
re.sub("(this).*(string)","this is my string",'<markup>\anygroup</markup>')
>> "<markup>this</markup> is my <markup>string</markup>"
is that even possible in single line?
For a constant regexp like in your example, do
re.sub("(this)(.*)(string)",
r'<markup>\1</markup>\2<markup>\3</markup>',
text)
Note that you need to enclose .* in parentheses as well if you don't want do lose it.
Now if you don't know what the regexp looks like, it's more difficult, but should be doable.
pattern = "(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 0
else s for n, s in enumerate(m.groups())),
text)
If the first thing matched by your pattern doesn't necessarily have to be marked up, use this instead, with the first group optionally matching some prefix text that should be left alone:
pattern = "()(this)(.*)(string)"
re.sub(pattern,
lambda m: ''.join('<markup>%s</markup>' % s if n % 2 == 1
else s for n, s in enumerate(m.groups())),
text)
You get the idea.
If your regexps are complicated and you're not sure you can make everything part of a group, where only every second group needs to be marked up, you might do something smarter with a more complicated function:
pattern = "(this).*(string)"
def replacement(m):
s = m.group()
n_groups = len(m.groups())
# assume groups do not overlap and are listed left-to-right
for i in range(n_groups, 0, -1):
lo, hi = m.span(i)
s = s[:lo] + '<markup>' + s[lo:hi] + '</markup>' + s[hi:]
return s
re.sub(pattern, replacement, text)
If you need to handle overlapping groups, you're on your own, but it should be doable.
re.sub() will replace everything it can. If you pass it a function for repl then you can do even more.
Yes, this can be done in a single line.
>>> re.sub(r"\b(this|string)\b", r"<markup>\1</markup>", "this is my string")
'<markup>this</markup> is my <markup>string</markup>'
\b ensures that only complete words are matched.
So if you have a list of words that you need to mark up, you could do the following:
>>> mywords = ["this", "string", "words"]
>>> myre = r"\b(" + "|".join(mywords) + r")\b"
>>> re.sub(myre, r"<markup>\1</markup>", "this is my string with many words!")
'<markup>this</markup> is my <markup>string</markup> with many <markup>words</markup>!'