split string based on regular expression value in python - python

suppose i have a string
exp = '"security_datafilter"."PRODUCT_CATEGORIES"."CATEGORY_NAME" IN ("CPU","Storage")'
I want to split the string based on word IN
so my exprected result is
['"security_datafilter"."PRODUCT_CATEGORIES"."CATEGORY_NAME"','IN','("CPU","Storage")']
but in my case it doesnt work
This is what i have tried
import re
exp_split = re.split(r'( in )',exp,re.I)

re documentation:
re.split(pattern, string, maxsplit=0, flags=0)
The split() function expects that the third positional argument is the maxsplit argument. Your code gives re.I to maxsplit and no flags. You should give flags as a keyword argument like so:
exp_split = re.split(r'( in )',exp, flags=re.I)

its simply necessary to capitalize your delimiter and if you dont want the spaces in your result keep them outside your capturing group:
exp_split = re.split(r'\s(IN)\s', exp, re.I)
exp_split
Output
['"security_datafilter"."PRODUCT_CATEGORIES"."CATEGORY_NAME"', 'IN', '("CPU","Storage")']

Related

Ignore an optional word if present in a string - regular expression in python

I'm trying to match a string with regular expression using Python, but ignore an optional word if it's present.
For example, I have the following lines:
First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]
I'm looking to capture everything before [Ignore This Part]. Notice I also want to exclude the whitespace before [Ignore This Part]. Therefore my results should look like this:
First string
Second string
Third string (1)
I have tried the following regular expression with no luck, because it still captures [Ignore This Part]:
.+(?:\s\[.+\])?
Any assistance would be appreciated.
I'm using python 3.8 on Window 10.
Edit: The examples are meant to be processed one line at a time.
Use [^[] instead of . so it doesn't match anything with square brackets and doesn't match across newlines.
^[^[\n]+(?\s\[.+\])?
DEMO
Perhaps you can remove the part that you don't want to match:
[^\S\n]*\[[^][\n]*]$
Explanation
[^\S\n]* Match optional spaces
\[[^][\n]*] Match from [....]
$ End of string
Regex demo
Example
import re
pattern = r"[^\S\n]*\[[^][\n]*]$"
s = ("First string\n"
"Second string [Ignore This Part]\n"
"Third string (1) [Ignore This Part]")
result = re.sub(pattern, "", s, 0, re.M)
if result:
print(result)
Output
First string
Second string
Third string (1)
If you don't want to be left with an empty string, you can assert a non whitespace char to the left:
(?<=\S)[^\S\n]*\[[^][\n]*]$
Regex demo
With your shown samples, please try following code, written and tested in Python3.
import re
var="""First string
Second string [Ignore This Part]
Third string (1) [Ignore This Part]"""
[x for x in list(map(lambda x:x.strip(),re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var))) if x]
Output will be as follows, in form of list which could be accessed as per requirement.
['First string', 'Second string', 'Third string (1)']
Here is the complete detailed explanation for above Python3 code:
Firstly using re module's split function where passing regex (.*?)(?:$|\s\[[^]]*\]) with multiline reading flag enabled. This is complete function of split: re.split(r'(?m)(.*?)(?:$|\s\[[^]]*\])',var)
Then passing its output to a lambda function to use strip function to remove elements which are having new lines in it.
Applying map to it and creating list from it.
Then simply removing NULL items from list to get only required part as per OP.
You may use this regex:
^.+?(?=$|\s*\[[^]]*]$)
RegEx Demo
If you want better performing regex then I suggest:
^\S+(?:\s+\S+)*?(?=$|\s*\[[^]]*]$)
RegEx Demo 2
RegEx Details:
^: Start
.+?: Match 1+ of any characters (lazy match)
(?=: Start lookahead
$: End
|: OR
\s*: Match 0 or more whitespaces
\[[^]]*]: Match [...] text
$: End
): Close lookahead

Replace substring surrounding main string

I have a lot of strings like the following:
\frac{l_{2}\,\mathrm{phi2dd}\,\sin\left(\varphi _{2}\right)}{2}
I want to replace the \frac{***}{2} to \frac{1}{2} ***
The desired string would then become:
\frac{1}{2} l_{2}\,\mathrm{phi2dd}\,\sin\left(\varphi _{2}\right)
I thought I could use a regular expression to do so, but I can't quite figure out how to extract the 'main string' from the substring.
Update: I simplified the problem a bit too much. The strings I have to replace actually contain multiple 'fracs', like so:
I_{2}\,\mathrm{phi2dd}-\frac{l_{2}\,\mathrm{lm}_{4}\,\cos\left(\varphi _{2}\right)}{2}+\frac{l_{2}\,\mathrm{lm}_{3}\,\sin\left(\varphi _{2}\right)}{2}=0
I don't know the number of occurances in the string, this is varying.
Match using \\frac\{(.*?)\}\{2} and substitute using \\frac{1}{2} \1
Updated code:
import re
regex = r"\\frac\{(.*?)\}\{2}"
test_str = "I_{2}\\,\\mathrm{phi2dd}-\\frac{l_{2}\\,\\mathrm{lm}_{4}\\,\\cos\\left(\\varphi _{2}\\right)}{2}+\\frac{l_{2}\\,\\mathrm{lm}_{3}\\,\\sin\\left(\\varphi _{2}\\right)}{2}=0"
subst = "\\\\frac{1}{2} \\1"
# 4th argument decides how many occurences to replace
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)

How to print regex match results in python 3?

I was in IDLE, and decided to use regex to sort out a string. But when I typed in what the online tutorial told me to, all it would do was print:
<_sre.SRE_Match object at 0x00000000031D7E68>
Full program:
import re
reg = re.compile("[a-z]+8?")
str = "ccc8"
print(reg.match(str))
result:
<_sre.SRE_Match object at 0x00000000031D7ED0>
Could anybody tell me how to actually print the result?
You need to include .group() after to the match function so that it would print the matched string otherwise it shows only whether a match happened or not. To print the chars which are captured by the capturing groups, you need to pass the corresponding group index to the .group() function.
>>> import re
>>> reg = re.compile("[a-z]+8?")
>>> str = "ccc8"
>>> print(reg.match(str).group())
ccc8
Regex with capturing group.
>>> reg = re.compile("([a-z]+)8?")
>>> print(reg.match(str).group(1))
ccc
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you need to get the whole match value, you should use
m = reg.match(r"[a-z]+8?", text)
if m: # Always check if a match occurred to avoid NoneType issues
print(m.group()) # Print the match string
If you need to extract a part of the regex match, you need to use capturing groups in your regular expression. Enclose those patterns with a pair of unescaped parentheses.
To only print captured group results, use Match.groups:
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
So, to get ccc and 8 and display only those, you may use
import re
reg = re.compile("([a-z]+)(8?)")
s = "ccc8"
m = reg.match(s)
if m:
print(m.groups()) # => ('ccc', '8')
See the Python demo

Search and replace string with reverse

I am trying to find words in a string and replace them with themselves in reverse-form.
So, when I have This 17, I want to put out sihT 17.
But I don't know how to reverse the string itself in re.sub()
import re
pat_word = re.compile("[a-zA-Z]+")
input = raw_input ("Input: ")
match = pat_word.findall(input)
if match:
s = re.sub(pat_word, "reverse", input)
print s
You can use a function inside re.sub:
s = re.sub(pat_word, lambda m:m.group(0)[::-1], input)
Or simply:
s = pat_word.sub(lambda m:m.group(0)[::-1], input)
From help(re.sub):
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
Note that input is a built-in function in Python, so don't use it as a variable name.

Regular expression to replace with XML node

I'm using Python to write a regular expression for replacing parts of the string with a XML node.
The source string looks like:
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
And the result string should be like:
Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>
Can anyone help me?
What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.
Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
def mksub(m):
return '<replace name="%s">%s</replace>' % m.groups()
s_output = re.sub(pat, mksub, s_input)
The only tricky part is the regular expression pattern. Let's look at it in detail.
^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.
\s* matches optional whitespace.
REPLACE matches the literal string "REPLACE".
\( matches the literal string "(".
( begins a "match group".
[^)] means "match any character but a ")".
+ means "match one or more of the preceding pattern.
) closes a "match group".
\) matches the literal string ")"
(.*) is another match group containing ".*".
$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.
. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.
So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.
EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_repl = r'<replace name="\1">\2</replace>'
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
s_output = re.sub(pat, s_repl, s_input)
Here is an excellent tutorial on how to write regular expressions in Python.
Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.
src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""
from pyparsing import Suppress, Word, alphas, alphanums, restOfLine
LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
)
print replExpr.transformString(src)
In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in #steveha's solution.
In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:
s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"
Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:
class CallableDict(object):
def __init__(self,fn):
self.fn = fn
def __getitem__(self,name):
return self.fn(name)
def mksub(m):
return '<replace name="%(name)s">%(body)s</replace>' % CallableDict(m.group)
s_output = re.sub(pat, mksub, s_input)
Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.
Maybe like this ?
import re
mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')
for line in mystr.split("\n"):
print prog.sub(r'< replace name="\1" > \2',line)
Something like this should work:
import re,sys
f = open( sys.argv[1], 'r' )
for i in f:
g = re.match( r'REPLACE\((.*)\)(.*)', i )
if g is None:
print i
else:
print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()
import re
a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)
b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)
print b
will do the replace in one line.

Categories