Accessing matched substrings when substituting using regular expressions in Python

Accessing matched substrings when substituting using regular expressions in Python - python

I want to match two regular expressions A and B where A and B appear as 'AB'. I want to then insert a space between A and B so that it becomes 'A B'.
For example, if A = [0-9] and B = !+, I want to do something like the following.
match = re.sub('[0-9]!+', '[0-9] !+', input_string)
But, this obviously does not work as this will replace any matches with a string '[0-9] !+'.
How do I do this in regular expressions (preferably in one line)? Or does this require several tedious steps?

Use the groups!
match = re.sub('([0-9])(!+)', r'\1 \2', input_string);
\1 and \2 indicate the first and second parenthesised fragment. The prefix r is used to keep the \ character intact.

Suppose the input string is "I have 5G network" but you want whitespace between 5 and G i.e. whenever there are expressions like G20 or AK47, you want to separate the digit and the alphabets (I have 5 G network). In this case, you need to replace a regex expression with another regular expression. Something like this:
re.sub(r'\w\d',r'\w \d',input_string)
But this won't work as the substituting string will not retain the string caught by the first regular expression.
Solution:
It can be easily solved by accessing the groups in the regex substitution. This method will work well if you want to add something to the spotted groups.
re.sub(r"(\..*$)",r"_BACK\1","my_file.jpg") and re.sub(r'(\d+)',r'<num>\1</num>',"I have 25 cents")
You can use this method to solve your question as well by capturing two groups instead of one.
re.sub(r"([A-Z])(\d)",r"\1 \2",input_string)
Another way to do it, is by using lambda functions:
re.sub(r"(\w\d)",lambda d: d.group(0)[0]+' '+d.group(0)[1],input_string)
And another way of doing it is by using look-aheads:
re.sub(r"(?<=[A-Z])(?=\d)",r" ",input_string)

Related

How to replace '..' and '?.' with single periods and question marks in pandas? df['column'].str.replace not working

This is a follow up to this SO post which gives a solution to replace text in a string column
How to replace text in a column of a Pandas dataframe?
df['range'] = df['range'].str.replace(',','-')
However, this doesn't seem to work with double periods or a question mark followed by a period
testList = ['this is a.. test stence', 'for which is ?. was a time']
testDf = pd.DataFrame(testList, columns=['strings'])
testDf['strings'].str.replace('..', '.').head()
results in
0 ...........e
1 .............
Name: strings, dtype: object
and
testDf['strings'].str.replace('?.', '?').head()
results in
error: nothing to repeat at position 0

Add regex=False parameter, because as you can see in the docs, regex it's by default True:
-regex bool, default True
Determines if assumes the passed-in pattern is a regular expression:
If True, assumes the passed-in pattern is a regular expression.
And ? . are special characters in regular expressions.
So, one way to do it without regex will be this double replacing:
testDf['strings'].str.replace('..', '.',regex=False).str.replace('?.', '?',regex=False)
Output:
strings
0 this is a. test stence
1 for which is ? was a time

Replace using regular expression. In this case, replace any sepcial character '.' followed immediately by white space. This is abit curly, I advice you go with #Mark Reed answer.
testDf.replace(regex=r'([.](?=\s))', value=r'')
strings
0 this is a. test stence
1 for which is ? was a time

str.replace() works with a Regex where . is a special character which denotes "any" character. If you want a literal dot, you need to escape it: "\.". Same for other special Regex characters like ?.

First, be aware that the Pandas replace method is different from the standard Python one, which operates only on fixed strings. The Pandas one can behave as either the regular string.replace or re.sub (the regular-expression substitute method), depending on the value of a flag, and the default is to act like re.sub. So you need to treat your first argument as a regular expression. That means you do have to change the string, but it also has the benefit of allowing you to do both substitutions in a single call.
A regular expression isn't a string to be searched for literally, but a pattern that acts as instructions telling Python what to look for. Most characters just ask Python to match themselves, but some are special, and both . and ? happen to be in the special category.
The easiest thing to do is to use a character class to match either . or ? followed by a period, and remember which one it was so that it can be included in the replacement, just without the following period. That looks like this:
testDF.replace(regex=r'([.?])\.', value=r'\1')
The [.?] means "match either a period or a question mark"; since they're inside the [...], those normally-special characters don't need to be escaped. The parentheses around the square brackets tell Python to remember which of those two characters is the one it actually found. The next thing that has to be there in order to match is the period you're trying to get rid of, which has to be escaped with a backslash because this one's not inside [...].
In the replacement, the special sequence \1 means "whatever you found that matched the pattern between the first set of parentheses", so that's either the period or question mark. Since that's the entire replacement, the following period is removed.
Now, you'll notice I used raw strings (r'...') for both; that keeps Python from doing its own interpretation of the backslashes before replace can. If the replacement were just '\1' without the r it would replace them with character code 1 (control-A) instead of the first matched group.

To replace both the ? and . at the same time you can separate by | (the regex OR operator).
testDf['strings'].str.replace('\?.|\..', '.')
Prefix the .. with a \, because you need to escape as . is a regex character:
testDf['strings'].str.replace('\..', '.')
You can do the same with the ?, which is another regex character.
testDf['strings'].str.replace('\?.', '.')

Regular expression match / split

I am having some trouble trying to figure out how to use regular expressions in python. Ultimately I am trying to do what sscanf does for me in C.
I am trying to match given strings that look like so:
12345_arbitrarystring_2020_05_20_10_10_10.dat
I (seem) to be able to validate this format by calling match on the following regular expression
regex = re.compile('[0-9]{5}_.+_[0-9]{4}([-_])[0-9]{2}([-_])[0-9]{2}([-_])[0-9]{2}([:_])[0-9]{2}([:_])[0-9]{2}\\.dat')
(Note that I do allow for a few other separators then just '_')
I would like to split the given string on these separators so I do:
regex = re.compile('[_\\-:.]+')
parts = regex.split(given_string)
This is all fine .. the problem is that I would like my 'arbitrarystring' part to include '-' and '_' and the last split currently, well, splits them.
Other than manually cutting the timestamp and the first 5 digits off that given string, what can I do to get that arbitrarystring part?

You could use a capturing group to get the arbitrarystring part and omit the other capturing groups.
You could for example use a character class to match 1+ word characters or a hyphen using [\w-]+
If you still want to use split, you could add capturing groups for the first and the second part, and split only those groups.
^[0-9]{5}_([\w-]+)_[0-9]{4}[-_][0-9]{2}[-_][0-9]{2}[-_][0-9]{2}[:_][0-9]{2}[:_][0-9]{2}\.dat$
^^^^^^^^
Regex demo

It seems to be possible to cut down your regex to validate the whole pattern to:
^\d{5}_(.+?)_\d{4}[-_](?:\d{2}[-_]){2}(?:\d{2}[:_]){2}\d{2}\.dat$
Refer to group 1 for your arbitrary string.
Online demo
Quick reminder: You didn't seem to have used raw strings, but instead escaping with a double backslash. Python has raw strings which makes you don't have to escape backslashes nomore.

Regular Expression: Include Text After (...) Group

I am learning about regular expressions. I need to match things in a parenthesis group followed by some pattern that I define. When I try this with regular expressions (in Python), it only returns the part in parentheses that it matched, but not the pattern which follows it. An example should clarify:
import re
s = "texttoignore_ABCABC12345_moretexttoignore"
re.findall("(ABC)+\d+", s)
When I speak of the parenthesis group, in the example above this is the "(ABC)+" part. What I intend is for it to look for one or more repetitions of the pattern in parentheses (in this case "ABC"), then the pattern after.
The problem is this: it does not return the pattern after. (In this example, it would return 'ABC', but I would want 'ABCABC12345' or 'ABC12345' or better yet '12345')
How can you include the part after the parentheses in the return value? Is this something about regular expressions or is it specific to this Python method?
Thanks!
John

The "problem" here is that rather specific behavior of re.findall
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
There are a few options you have here. Either make your group non-capturing:
>>> re.findall("(?:ABC)+\d+", s)
['ABCABC12345']
or use re.finditer:
>>> [m.group(0) for m in re.finditer("(ABC)+\d+", s)]
['ABCABC12345']
If you only want to find the pattern once, then #Jkdc's approach from the comments works fine.
>>> re.search("(ABC)+\d+", s).group()
'ABCABC12345'

python how to replace string by regex group?

Give an string like '/apps/platform/app/app_name/etc', I can use
p = re.compile('/apps/(?P<p1>.*)/app/(?P<p2>.*)/')
to get two matched groups of platform and app_name, but how can I use re.sub function (or maybe better way) to replace those two groups with other string like windows and facebook? So the final string would like /apps/windows/app/facebook/etc.

Separate group replacement wouldn't be possible through regex. So i suggest you to do like this.
(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/
DEMO
Then replace the matched characters with windows\2facebook/ . And also i suggest you to define your regex as raw string. Lookbehind is used inorder to avoid extra capturing group.
>>> s = '/apps/platform/app/app_name/etc'
>>> re.sub(r'(?<=/apps/)(?P<p1>.*)(/app/)(?P<p2>.*)/', r'windows\2facebook/', s)
'/apps/windows/app/facebook/etc'

Python regular expression to match # followed by 0-7 followed by ##

I would like to intercept string starting with \*#\*
followed by a number between 0 and 7
and ending with: ##
so something like \*#\*0##
but I could not find a regex for this

Assuming you want to allow only one # before and two after, I'd do it like this:
r'^(\#{1}([0-7])\#{2})'
It's important to note that Alex's regex will also match things like
###7######
########1###
which may or may not matter.
My regex above matches a string starting with #[0-7]## and ignores the end of the string. You could tack a $ onto the end if you wanted it to match only if that's the entire line.
The first backreference gives you the entire #<number>## string and the second backreference gives you the number inside the #.

None of the above examples are taking into account the *#*
^\*#\*[0-7]##$
Pass : *#*7##
Fail : *#*22324324##
Fail : *#3232#
The ^ character will match the start of the string, \* will match a single asterisk, the # characters do not need to be escape in this example, and finally the [0-7] will only match a single character between 0 and 7.

r'\#[0-7]\#\#'

The regular expression should be like ^#[0-7]##$

As I understand the question, the simplest regular expression you need is:
rex= re.compile(r'^\*#\*([0-7])##$')
The {1} constructs are redundant.
After doing rex.match (or rex.search, but it's not necessary here), .group(1) of the match object contains the digit given.
EDIT: The whole matched string is always available as match.group(0). If all you need is the complete string, drop any parentheses in the regular expression:
rex= re.compile(r'^\*#\*[0-7]##$')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.