Find substring inside parentheses and single quotes - python

I would like to get the string that is within the parentheses and within the ''.
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
result = re.findall(r'\((.*?) *\)', a_string)[0][1:-1]
print(result)
This block works. But isn't there a more elegant way?

It's quite inefficient to use re.findall() for retrieving just first match, better to use re.search() which will return after first match found. Also you can simplify your regular expression to \('(.+?)'\).
Code:
import re
...
result = re.search(r"\('(.+?)'\)", a_string).group(1)

If you have only one pair of parentheses in your line, you can do it without regex:
a_string = "_TableStyle('Table Grid') id: 219287880"
openParenthesis = a_string.index("('") + 2
closeParenthesis = a_string.index("')", openParenthesis)
# 'Table Grid'
a_string[openParenthesis:closeParenthesis]

Yes. There's another way to achieve this. Here it is,
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
result = re.findall(r'\(\'(.*?) *\'\)', a_string)[0]
print(result)
This code searches also for a ' ' and would only extract the text between the (' ').
Edit:
We can also write the code like this,
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
result = re.search(r'\(\'(.*?) *\'\)', a_string).group(1)
print(result)

Use double quotes around the literal if you do not want to escape single quotes. Triple quotes are even nicer. Getting part of a regex match is simple with capturing groups:
import re
a_string = "_TableStyle('Table Grid') id: 219287880"
match = re.search(r"""\('([^']*)'\)""", a_string)
if match:
print(match.group(1))
See Python proof.
EXPLANATION
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
\) ')'

Related

Getting pattern matched from match object

I'm working with Python regex and I'm trying to get pattern matched from a match object not text matched itself.
I have some patterns to replace and I'm doing this:
import re
patterns = {
r'^[ |\n]+': '',
r'[ |\n]+$': '',
r'[ |\n]+': ' '
}
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join(patterns.keys()),
lambda match: patterns[ match.group(0) ],
text)
But this is a wrong solution because match.group(0) returns text matched so none of them will be equals to any key of patterns dict.
I tried match.pattern but get an exception and tried match.re but this gives all re.compile object and its pattern for this problem is '^[ |\n]+|[ |\n]+$|[ |\n]+'.
EDIT: based on Barmar solution I got this:
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i, group in enumerate(match.groups()):
if group:
return patterns[ i ][ 1 ]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[ 0 ] + ')' for p in patterns), getreplacement, text)
print(text)
But still not a way to always get pattern from a match group.
I don't think there's a way to find out directly which alternative matched.
Use a list instead of a dictionary, and put each pattern in a capture group. Then you can see which capture group matched, and use that as the index to get the corresponding replacement.
Note that this won't work if there are any capture groups in the patterns. If groups are needed, make sure they're non-capturing.
import re
patterns = [
(r'^[ |\n]+', ''),
(r'[ |\n]+$', ''),
(r'[ |\n]+', ' ')
]
def getreplacement(match):
for i in range(1, match.groups):
if match.group(i):
return patterns[i-1][1]
text = ' Hello there, I\n need your help here plase :) '
text = re.sub('|'.join('(' + p[0] + ')' for p in patterns), getreplacement, text)
If I got it right, you want to strip leading and trailing spaces and reduce the ones in the middle to just one.
First, you code likely has a bug: [ |\n] will match a space ( ), a pipe (|), or a new line. You probably don't want to match a pipe, but you might want to match all whitespace characters, like tabs (\t), for example.
Second, styling: keep your lines under 80 chars and no spaces around indices in brackets.
Third, removing the leading and trailing spaces is simply done with str.strip. The only thing remaining to replace now is sequences of two or more whitespaces, which is easily matched with \s{2,} (\s = "whitespace", {2,} = "two or more").
Here is a modification of your code:
import re
patterns = [
(r"^[ |\n]+", ""),
(r"[ |\n]+$", ""),
(r"[ |\n]+", " "),
]
def get_replacement(m: re.Match) -> str:
return next(
patterns[i][1]
for i, group in enumerate(m.groups())
if group is not None
)
text = (
"\n"
" \t Hello there, I\n need your help here plase :) \t \n"
" \t Hello there, I\n need your help here plase :) \t "
"\n"
)
result1 = re.sub(
"|".join(f"({p})" for p, _ in patterns),
get_replacement,
text,
)
result2 = re.sub(r"[ \n]{2,}", " ", text.strip())
result3 = re.sub(r"\s{2,}", " ", text.strip())
print(repr(result1))
print(repr(result2))
print(repr(result3))

How can I remove a specific character from multi line string using regex in python

I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)

How to use join and regex?

I'm trying to add \n after the quotation mark (") and space.
The closest that I could find is re.sub however it remove certain characters.
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
q = re.sub(r'[\d\w]" ', '\n', line)
print(q)
Output:
Type: "SecurityInciden\nRowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F2\n
Looking for a solution without any character being remove.
Your attempted regex [\d\w]" is almost fine but has some little short comings. You don't need to write \d with \w in a character set as that is redundant as \w already contains \d within it. Since \w alone is enough to represent an alphabet or digit or underscore, hence no need to enclose it in character set [] hence you can just write \w and your updated regex becomes \w".
But now if you match this regex and substitute it with \n it will match a literal alphabet t then " and a space and it will be replaced by \n which is why you are getting this output,
SecurityInciden\nRowID
You need to capture the matched string in group1 and while substituting, you need to use it while substituting so that doesn't get replaced hence you should use \1\n as replacement instead of just \n
Try this updated regex,
(\w" )
And replace it by \1\n
Demo1
If you notice, there is an extra space at the end of line in the first line and if you don't want that space there, you can take that space out of those capturing parenthesis and use this regex,
(\w")
^ space here
Demo2
Here is a sample python code,
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
q = re.sub(r'(\w") ', r'\1\n', line)
print(q)
Output,
Type: "SecurityIncident"
RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"
Try this:
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
pattern = re.compile('(\w+): (".+?"\s?)', re.IGNORECASE)
q = re.sub(pattern, r'\g<1>: \g<2>\n', line)
print(repr(q))
It should give you following resutls:
Type: "SecurityIncident" \nRowID:
"FB013B06-B04C-4FEB-A5A5-3B858F910F29"\n
In your regex you are removing the t from incident because you are matching it and not using it in the replacement.
Another option to get your result might be to split on a double quote followed by a whitespace when preceded with a word character using a positive lookbehind.
Then join the result back together using a newline.
(?<=\w)"
Regex demo | Python demo
For example:
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
print("\n".join(re.split(r'(?<=\w)" ', line)))
Result
Type: "SecurityIncident
RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"

regex and python

I have a string:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
I want to extract the "T" character from the Timestamp, ie change it to:
"123ABC,'2009-12-23 23:45:58.544-04:00'"
I am trying this:
newString = re.sub('(?:\-\d{2})T(?:\d{2}\:)', ' ', myString)
BUT, the returned string is:
"123ABC,'2009-12 45:58.544-04:00'"
The "non capturing groups" don't appear to be "non capturing", and it's removing everything. What am I doing wrong?
You can use lookarounds (positive lookbehind and -ahead):
(?<=\d)T(?=\d)
See a demo on regex101.com.
In Python this would be:
import re
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
rx = r'(?<=\d)T(?=\d)'
# match a T surrounded by digits
new_string = re.sub(rx, ' ', myString)
print new_string
# 123ABC,'2009-12-23 23:45:58.544-04:00'
See a demo on ideone.com.
regex seems a bit of an overkill:
mystring.replace("T"," ")
I'd use capturing groups, unanchored lookbehinds are costly in terms of regex performance:
(\d)T(\d)
And replace with r'\1 \2' replacement pattern containing backreferences to the digit before and after T. See the regex demo
Python demo:
import re
s = "123ABC,'2009-12-23T23:45:58.544-04:00'"
reg = re.compile(r'(\d)T(\d)')
s = reg.sub(r'\1 \2', s)
print(s)
That T is trapped in between numbers and will always be alone on the right. You could use a rsplit and join:
myString = "123ABC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ABC,'2009-12-23 23:45:58.544-04:00'"
Trying this on a leading T somewhere in the string:
myString = "123ATC,'2009-12-23T23:45:58.544-04:00'"
s = ' '.join(myString.rsplit('T', maxsplit=1))
print(s)
# "123ATC,'2009-12-23 23:45:58.544-04:00'"

Regex replace with negative look ahead in Python

I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!
Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""
re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"

Categories