How to create regular expression to substitute strings surrounded by parentheses? - python

I'm trying to substitute all chars inside () alongside with what's inside them but there is a problem. In the output it leaves whitespaces at start and end.
Code:
import re
regex = r"\(.+?\)"
test_str = ("(a) method in/to one's madness\n"
"(all) by one's lonesome\n"
"(as) tough as (old boot's)\n"
" (at) any moment (now) \n"
"factors (in or into or out) \n"
" right-to-life\n"
"all mouth (and no trousers/action)\n"
"(it's a) small world\n"
" throw (someone) a bone ")
subst = ""
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Result:
method in/to one's madness
by one's lonesome
tough as
any moment
factors
right-to-life
all mouth
small world
throw a bone
I tried different patterns (other than this) to remove the \s from start but then when it finds the space at the end of any line it combines the following lines to the preceding one's.

You can use
regex = r"[^\S\r\n]*\([^()]*\)"
result = "\n".join([x.strip() for x in re.sub(regex, "", test_str).splitlines()])
See the Python demo
The [^\S\r\n]*\([^()]*\) regex will remove all instances of
[^\S\r\n]* - zero or more horizontal whitespaces and then
\([^()]*\) - (, any zero or more chars other than ( and ) and then )
The "\n".join([x.strip() for x in re.sub(regex, "", test_str).splitlines()]) part splits all text into lines, strips them from leading/trailing whitespace and joins them back with a line feed.

You can go with /\(([^)]+)\)/g
So basically:
\(: matches the opening parenthesis
([^)]+: matches a group of characters
\): matches the closing parenthesis
/g: all matches
User you re.sub(...) to replace all the regex matches.

Related

regex python catch selective content inside curly braces, including curly sublevels and \n chars

regex python catch selective content inside curly braces, including curly sublevels
The best explanation is a minimum representative example (as you can see is for .bib for those who know latex..). Here is the representative input raw text:
text = """
#book{book1,
title={tit1},
author={aut1}
}
#article{art2,
title={tit2},
author={aut2}
}
#article{art3,
title={tit3},
author={aut3}
}
"""
and here is my try (I failed..) to extract the content inside curly braces only for #article fields.. note that there are \n jumps inside that also want to gather.
regexpresion = r'\#article\{[.*\n]+\}'
result = re.findall(regexpresion, text)
and this is actually what I wanted to obtain,
>>> result
['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']
Many thanks for your experience
You might use a 2 step approach, first matching the parts that start with #article, and then in the second step remove the parts that you don't want in the result.
The pattern to match all the parts:
^#article{.*(?:\n(?!#\w+{).*)+(?=\n}$)
Explanation
^ Start of string
#article{.* Match #article{ and the rest of the line
(?: Non capture group
\n(?!#\w+{).* Match a newline and the rest of the line if it does not start with # 1+ word chars and {
)+ Close the non capture group and repeat it to match all lines
(?=\n}$) Positive lookahead to assert a newline and } at the end of the string
See the matches on regex101.
The pattern in the replacement matches either #article{ or (using the pipe char |) 1 one or more spaces after a newline.
#article{|(?<=\n)[^\S\n]+
Example
import re
pattern = r"^#article{.*(?:\n(?!#\w+{).*)+(?=\n}$)"
s = ("#book{book1,\n"
" title={tit1},\n"
" author={aut1}\n"
"}\n"
"#article{art2,\n"
" title={tit2},\n"
" author={aut2}\n"
"}\n"
"#article{art3,\n"
" title={tit3},\n"
" author={aut3}\n"
"}")
res = [re.sub(r"#article{|(?<=\n)[^\S\n]+", "", m) for m in re.findall(pattern, s, re.M)]
print(res)
Output
['art2,\ntitle={tit2},\nauthor={aut2}', 'art3,\ntitle={tit3},\nauthor={aut3}']
Try this :
results = re.findall(r'{(.*?)}', text)
the output is following :
['tit1', 'aut1', 'tit2', 'aut2', 'tit3', 'aut3']
Here is my solution for regexpression. It's not very elegant, basic.
regexpression = r'\#article\{\w+,\n\s+\w+\=\{.*?\},\n\s+\w+\=\{.*?\}'
aclaratory breakdown of regexpression:
r'\#article\{\w+,\n # catches the article field, 1st line
\s+\w+\=\{.*?\},\n # title sub-field, comma, new line,
\s+\w+\=\{.*?\} # author sub-field

How can I remove a specific character from multi line string using regex in python

I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)

regex capture info in text file after multiple blank lines

I open a complex text file in python, match everything else I need with regex but am stuck with one search.
I want to capture the numbers after the 'start after here' line. The space between the two rows is important and plan to split later.
start after here: test
5.7,-9.0,6.2
1.6,3.79,3.3
Code:
text = open(r"file.txt","r")
for line in text:
find = re.findall(r"start after here:[\s]\D+.+", line)
I tried this here https://regexr.com/ and it seems to work but it is for Java.
It doesn't find anything. I assume this is because I need to incorporate multiline but unsure how to read file in differently or incorporate. Have been trying many adjustments to regex but have not been successful.
import re
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
m = re.search(r'start after here:([^\n])+\n+(.*)', test_str)
new_str = m[2]
m = re.search(r'(-?\d*\.\d*,?\s*)+', new_str)
print(m[0])
The pattern start after here:[\s]\D+.+ matches the literal words and then a whitespace char using [\s] (you can omit the brackets).
Then 1+ times not a digit is matched, which will match until before 5.7. Then 1+ times any character except a newline will be matched which will match 5.7,-9.0,6.2 It will not match the following empty line and the next line.
One option could be to match your string and match all the lines after that do not start with a decimal in a capturing group.
\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*
The values including the empty line are in the first capturing group.
For example
import re
regex = r"\bstart after here:.*[\r\n]+(\d+\.\d+.*(?:[\r\n]+[ \t]*\d+\.\d+.*)*).*"
test_str = ("start after here: test\n\n\n"
"5.7,-9.0,6.2\n\n"
"1.6,3.79,3.3\n")
matches = re.findall(regex, test_str)
print(matches)
Result
['5.7,-9.0,6.2\n\n1.6,3.79,3.3']
Regex demo | Python demo
If you want to match the decimals (or just one or more digits) before the comma you might split on 1 or more newlines and use:
[+-]?(?:\d+(?:\.\d+)?|\.\d+)(?=,|$)
Regex demo

How to use join and regex?

I'm trying to add \n after the quotation mark (") and space.
The closest that I could find is re.sub however it remove certain characters.
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
q = re.sub(r'[\d\w]" ', '\n', line)
print(q)
Output:
Type: "SecurityInciden\nRowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F2\n
Looking for a solution without any character being remove.
Your attempted regex [\d\w]" is almost fine but has some little short comings. You don't need to write \d with \w in a character set as that is redundant as \w already contains \d within it. Since \w alone is enough to represent an alphabet or digit or underscore, hence no need to enclose it in character set [] hence you can just write \w and your updated regex becomes \w".
But now if you match this regex and substitute it with \n it will match a literal alphabet t then " and a space and it will be replaced by \n which is why you are getting this output,
SecurityInciden\nRowID
You need to capture the matched string in group1 and while substituting, you need to use it while substituting so that doesn't get replaced hence you should use \1\n as replacement instead of just \n
Try this updated regex,
(\w" )
And replace it by \1\n
Demo1
If you notice, there is an extra space at the end of line in the first line and if you don't want that space there, you can take that space out of those capturing parenthesis and use this regex,
(\w")
^ space here
Demo2
Here is a sample python code,
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
q = re.sub(r'(\w") ', r'\1\n', line)
print(q)
Output,
Type: "SecurityIncident"
RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"
Try this:
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
pattern = re.compile('(\w+): (".+?"\s?)', re.IGNORECASE)
q = re.sub(pattern, r'\g<1>: \g<2>\n', line)
print(repr(q))
It should give you following resutls:
Type: "SecurityIncident" \nRowID:
"FB013B06-B04C-4FEB-A5A5-3B858F910F29"\n
In your regex you are removing the t from incident because you are matching it and not using it in the replacement.
Another option to get your result might be to split on a double quote followed by a whitespace when preceded with a word character using a positive lookbehind.
Then join the result back together using a newline.
(?<=\w)"
Regex demo | Python demo
For example:
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
print("\n".join(re.split(r'(?<=\w)" ', line)))
Result
Type: "SecurityIncident
RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"

How to capture the word with space around without capturing the space?

I've got a string like this s = "Hello this is Helloworld #helloworld #hiworld #nihaoworld " The idea is to catch all the hashtag however the hashtag needs to have a boundary around. e.g. if something like "Hello this is helloworld#helloworld"won't be captured.
I want to generate the following result as ["#helloworld","#hiworld","nihaoworld"]
I've got the following python code
import re
print re.findall('(?:^|\s+)(#[a-z]{1,})(?:\s+|$)', s)
The result I got is ["#helloworld","#nihaoworld"] with the middle word missing
I don't think you really need a regular expression for this, you can just use:
s.strip().split()
However, if you do want to use a regex, you could just use (?:^|\s)(#\w+):
>>> import re
>>> s = " #helloworld #hiworld #nihaoworld "
>>> re.findall(r'(?:^|\s)(#\w+)', s)
['#helloworld', '#hiworld', '#nihaoworld']
Explanation
Non-capturing group (?:^|\s)
1st Alternative ^
^ asserts position at start of the string
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group (#\w+)
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

Categories