python how to replace string by regex group?

python how to replace string by regex group? - python

Give an string like '/apps/platform/app/app_name/etc', I can use
p = re.compile('/apps/(?P<p1>.*)/app/(?P<p2>.*)/')
to get two matched groups of platform and app_name, but how can I use re.sub function (or maybe better way) to replace those two groups with other string like windows and facebook? So the final string would like /apps/windows/app/facebook/etc.

Separate group replacement wouldn't be possible through regex. So i suggest you to do like this.
(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/
DEMO
Then replace the matched characters with windows\2facebook/ . And also i suggest you to define your regex as raw string. Lookbehind is used inorder to avoid extra capturing group.
>>> s = '/apps/platform/app/app_name/etc'
>>> re.sub(r'(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/', r'windows\2facebook/', s)
'/apps/windows/app/facebook/etc'

Related

Regular expression match / split

I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.
I am trying to match given strings that look like so:
12345_arbitrarystring_2020_05_20_10_10_10.dat
I (seem) to be able to validate this format by calling match on the following regular expression
regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')
(Note that I do allow for a few other separators then just '_')
I would like to split the given string on these separators so I do:
regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string)
This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.
Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?

You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.
You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+
If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.
^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
^^^^^^^^
Regex demo

It seems to be possible to cut down your regex to validate the whole pattern to:
^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$
Refer to group 1 for your arbitrary string.
Online demo
Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.

Modifying a regular expression by another by adding a something to it

I am trying to modifying my regex expression using replace. What ultimately want to do is to add 01/ in the front of my existing pattern.It is litterally replacing a pattern by another.
Here is what I am doing with replace:
df['found_d'].str.replace(pattern2, '1/'+pattern2)
#must be str, not _sre.SRE_Pattern
I would like to use sub it takes 3 arguments and I am not too sure of how to use it at this point.
Here is an expected input:
df['found_d']= 01/07/91 or 01/07/1991
I need to add a missing date to my pattern.

No need for callables, re provides dedicated means to access the matched text during replacement.
In order to append a literal 01/ to a pattern match, use a \g<0> unambiguous backreference to the whole pattern in the replacement pattern rather than using the regex pattern:
df['found_d'] = df['found_d'].str.replace(pattern2, r'01/\g<0>')
^^^^^^^^^^^

Starting from version 0.20, pandas str.replace can accept a callable that will receive a match object. For example if a column has a pattern of 2 uppercase letters followed with 2 decimal digits and you would want to reverse them with a colon between, you could use:
df['col'] = df['col'].str.replace(r'([A-Z]{2})([0-9]{2})',
lamdba m: "{}:{}".format(m.group(2), m.group(1)))
It gives you the full power of the re module inside pandas, changing here 'AB12' with '12:AB'

Regular expressions: distinguish strings including/excluding a given word

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions.
I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?

You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.

You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
RegEx Demo

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?

From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.

The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.

When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Accessing matched substrings when substituting using regular expressions in Python

I want to match two regular expressions A and B where A and B appear as 'AB'. I want to then insert a space between A and B so that it becomes 'A B'.
For example, if A = [0-9] and B = !+, I want to do something like the following.
match = re.sub('[0-9]!+', '[0-9] !+', input_string)
But, this obviously does not work as this will replace any matches with a string '[0-9] !+'.
How do I do this in regular expressions (preferably in one line)? Or does this require several tedious steps?

Use the groups!
match = re.sub('([0-9])(!+)', r'\1 \2', input_string);
\1 and \2 indicate the first and second parenthesised fragment. The prefix r is used to keep the \ character intact.

Suppose the input string is "I have 5G network" but you want whitespace between 5 and G i.e. whenever there are expressions like G20 or AK47, you want to separate the digit and the alphabets (I have 5 G network). In this case, you need to replace a regex expression with another regular expression. Something like this:
re.sub(r'\w\d',r'\w \d',input_string)
But this won't work as the substituting string will not retain the string caught by the first regular expression.
Solution:
It can be easily solved by accessing the groups in the regex substitution. This method will work well if you want to add something to the spotted groups.
re.sub(r"(\..*$)",r"_BACK\1","my_file.jpg") and re.sub(r'(\d+)',r'<num>\1</num>',"I have 25 cents")
You can use this method to solve your question as well by capturing two groups instead of one.
re.sub(r"([A-Z])(\d)",r"\1 \2",input_string)
Another way to do it, is by using lambda functions:
re.sub(r"(\w\d)",lambda d: d.group(0)[0]+' '+d.group(0)[1],input_string)
And another way of doing it is by using look-aheads:
re.sub(r"(?<=[A-Z])(?=\d)",r" ",input_string)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python how to replace string by regex group? - python

Related

Regular expression match / split

Modifying a regular expression by another by adding a something to it

Regular expressions: distinguish strings including/excluding a given word

Backreferencing in Python: findall() method output for HTML string

Accessing matched substrings when substituting using regular expressions in Python

Categories

Resources