Regex to ignore data between brackets - python

I replace characters { , } , : , , with an empty string using below:
This code :
s = "\":{},"
print(s)
print(re.sub(r'\"|{|}' , "",s))
prints:
":{},
:,
which is expected.
I'm attempting to modify the regex to ignore everything between open and closed brackets. So for the string "\":{},[test,test2]" just :,[test,test2] should be returned.
How to modify the regex such that data contained between [ and ] is not applied by the regex.
I tried using:
s = "\":{},[test1, test2]"
print(s)
print(re.sub(r'[^a-zA-Z {}]+\"|{|}' , "",s))
(src: How to let regex ignore everything between brackets?)
None of the , values are replaced .

Assuming your brackets are balanced/unescaped, you may use this regex with a negative lookahead to assert that matched character is not inside [...]:
>>> import re
>>> s = "\":{},[test1,test2]"
>>> print (re.sub(r'[{}",](?![^[]*\])', '', s))
:[test1,test2]
RegEx Demo
RegEx Details:
[{}",]: Match one of those character inside [...]
(?![^[]*\]): Negative lookahead to assert that we don't have a ] ahead of without matching any [ in between, in other words matched character is not inside [...]

If you want to remove the {, }, , and " not inside square brackets, you can use
re.sub(r'(\[[^][]*])|[{}",]', r'\1', s)
See the regex demo. Note you can add more chars to the character set, [{}"]. If you need to add a hyphen, make sure it is the last char in the character set. Escape \, ] (if not the first, right after [) and ^ (if it comes first, right after [).
Details:
(\[[^][]*]) - Capturing group 1: a [...] substring
| - or
[{}",] - a {, }, , or " char.
See a Python demo using your sample input:
import re
s = "\":{},[test1, test2]"
print( re.sub(r'(\[[^][]*])|[{}",]', r'\1', s) )
## => :[test1, test2]

Related

Extracting the text after the initial substrings between square brackets

I would like to extract the substring from the string, such as
Case 1:
text = "some_txt" # → some_txt
Case2:
text = "[info1]some_txt" # → some_txt
Case3:
text = "[info1][info2] some_text" # → some_txt
Case4:
text = "[info1][info2] some_text_with_[___]_abc" # → some_text_with_[___]_abc
What I did was
match = re.search("^\[.+\] (.*)", text)
if match:
result = match.group(1)
It works okay except case 4, which gives abc only. I want to get some_text_with_[___]_abc instead.
Any help will be greatly appreciated.
With your current code, you can use
r"^(?:\[[^][]+](?:\s*\[[^][]+])*)?\s*(.*)"
See the regex demo.
If you are not actually interested in whether there is a match or not, you may use re.sub to remove these bracketed substrings from the start of the string using
re.sub(r'^\[[^][]+](?:\s*\[[^][]+])*\s*', '', text)
See another regex demo.
Regex details
^ - start of string
(?:\[[^][]+](?:\s*\[[^][]+])*)? - an optional occurrence of
\[[^][]+] - a [, then any one or more chars other than [ and ] as many as possible and then a ]
(?:\s*\[[^][]+])* - zero or more occurrences of zero or more whitespaces and then a [, then any one or more chars other than [ and ] as many as possible and then a ]
\s* - zero or more whitespaces
(.*) - Group 1: any zero or more chars other than line break chars, as many as possible.

Python, RegEx, Replace a certain part of a match

I am trying to replace a certain part of a match that a regex found.
The relevant strings have the following format:
"<Random text>[Text1;Text2;....;TextN]<Random text>"
So basically there can be N Texts seperated by a ";" inside the brackets.
My goal is to change the ";" into a "," (but only for the strings which are in this format) so that I can keep the ";" as a seperator for a CSV file. So the result should be:
"<Random text>[Text1,Text2,...,TextN]<Random text>"
I can match the relevant strings with something like
re.compile(r'\[".*?((;).*?){1,4}"\]')
but if I try to use the sub method it replaces the whole string.
I have searched stackoverflow and I am pretty sure that "capture groups" might be the solution but I am not really getting there.
Can anyone help me?
I ONLY want to change the ";" in the ["Text1;...;TextN"]-parts of my text file.
Try this regex:
;(?=(?:(?!\[).)*])
Replace each match with a ,
Click for Demo
Explanation:
; - matches a ;
(?=(?:(?!\[).)*]) - makes sure that the above ; is followed by a closing ] somewhere later in the string but before any opening bracket [
(?=....) - positive lookahead
(?:(?!\[).)* - 0+ occurrences of any character which does not start with [
] - matches a ]
If you want to match a ; before a closing ] and not matching [ in between you could use:
;(?=[^[]*])
; Match literally
(?= Positive lookahead, assert what is on the right is
[^[]* Negated character class, match 0+ times any char except [
] Match literally
) Close lookahead
Regex demo
Note that this will also match if there is no leading [
If you also want to make sure that there is a leading [ you could make use of the PyPi regex module and use \G and \K to match a single ;
(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;
Regex demo | Python demo
import regex
pattern = r"(?:\[(?=[^[\]]*])|\G(?!^))[^;[\]]*\K;"
test_str = ("[\"Text1;Text2;....;TextN\"];asjkdjksd;ajksdjksad[\"Text1;Text2;....;TextN\"]\n\n"
".[\"Text1;Text2\"]...long text...[\"Text1;Text2;Text3\"]....long text...[\"Text1;...;TextN\"]...long text...\n\n"
"I ONLY want to change the \";\" in the [\"Text1;...;TextN\"]")
result = regex.sub(pattern, ",", test_str)
print (result)
Output
["Text1,Text2,....,TextN"];asjkdjksd;ajksdjksad["Text1,Text2,....,TextN"]
.["Text1,Text2"]...long text...["Text1,Text2,Text3"]....long text...["Text1,...,TextN"]...long text...
I ONLY want to change the ";" in the ["Text1,...,TextN"]
You can try this code sample:
import re
x = 'anbhb["Text1;Text2;...;TextN"]nbgbyhuyg["Text1;Text2;...;TextN"][]nhj,kji,'
for i in range(len(x)):
if x[i] == '[' and x[i + 1] == '"':
while x[i+2] != '"':
list1 = list(x)
if x[i] == ';':
list1[i] = ','
x = ''.join(list1)
i = i + 1
print(x)

How to match and replace this pattern in Python RE?

s = "[abc]abx[abc]b"
s = re.sub("\[([^\]]*)\]a", "ABC", s)
'ABCbx[abc]b'
In the string, s, I want to match 'abc' when it's enclosed in [], and followed by a 'a'. So in that string, the first [abc] will be replaced, and the second won't.
I wrote the pattern above, it matches:
match anything starting with a '[', followed by any number of characters which is not ']', then followed by the character 'a'.
However, in the replacement, I want the string to be like:
[ABC]abx[abc]b . // NOT ABCbx[abc]b
Namely, I don't want the whole matched pattern to be replaced, but only anything with the bracket []. How to achieve that?
match.group(1) will return the content in []. But how to take advantage of this in re.sub?
Why not simply include [ and ] in the substitution?
s = re.sub("\[([^\]]*)\]a", "[ABC]a", s)
There exist more than 1 method, one of them is exploting groups.
import re
s = "[abc]abx[abc]b"
out = re.sub('(\[)([^\]]*)(\]a)', r'\1ABC\3', s)
print(out)
Output:
[ABC]abx[abc]b
Note that there are 3 groups (enclosed in brackets) in first argument of re.sub, then I refer to 1st and 3rd (note indexing starts at 1) so they remain unchanged, instead of 2nd group I put ABC. Second argument of re.sub is raw string, so I do not need to escape \.
This regex uses lookarounds for the prefix/suffix assertions, so that the match text itself is only "abc":
(?<=\[)[^]]*(?=\]a)
Example: https://regex101.com/r/NDlhZf/1
So that's:
(?<=\[) - positive look-behind, asserting that a literal [ is directly before the start of the match
[^]]* - any number of non-] characters (the actual match)
(?=\]a) - positive look-ahead, asserting that the text ]a directly follows the match text.

A way to match a SSHA hash using a regular expression

I'm trying to match four hashes that look like this:
{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c=
{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c=
I've successfully matched the first two with this regular expression: \D{5,}[a-zA-Z0-9]\w+\(?= however I am unable to get a full match on the third or the fourth one. What is a better regular expression to match the given hashes?
Note that \D{5,} matches 5 or more non-digit chars, and then [a-zA-Z0-9] matches an ASCII letter or digit and \w+ matches 1+ letters/digits/_. So, if you have - or / in the string, it won't get matches. Or if the first 5 chars contain a digit.
I suggest the following pattern:
\{[^{}]*}[a-zA-Z0-9][\w/-]+=?
See the regex demo.
It matches:
\{[^{}]*} - a {, then 0+ chars other than { and } and then } (note you may further precise it: \{\w+} to match {, 1 or more letters/digits/_, and then }, or even \{(?:SS?HA|MD5)} to match SHA, SSHA or MD5 enclosed with {...})
[a-zA-Z0-9] - an ASCII letter or digit
[\w/-]+ - 1 or more word chars (letters, digits or _)
=? - an optional, 1 or 0 occurrences (due to the ? quantifier) = symbols (greedy ? makes it match a = if it is found).
Python demo:
import re
s = """
TEXT {SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4= and some more
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c text here
{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c= maybe."""
rx = r"\{[^{}]*}[a-zA-Z0-9][\w/-]+=?"
print(re.findall(rx, s))
# => ['{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=', '{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=', '{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c', '{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c=']
I would suggest something along these lines:
\{[SHAMD5]{3,4}\}[^=]+=?
It will match a { then 3 or 4 characters that are the combinations you have listed of characters. You can change that to [A-Z0-9] to broaden it, but I like to keep it tighter to start. Then a }. Then all (at least 1) non = characters. Ending with an optional = character. Here is my python demo:
import re
textlist = [
"{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M="
,"{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4="
,"{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c"
,"test for break below"
,"{WORD}stuff="
,"{MD55/DNVWwyafo-pIEaHNhv39sSN7c="
,"MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
]
for text in textlist:
if re.search("\{[SHAMD5]{3,4}\}[^=]+=?", text):
print ("match")
else:
print ("no soup for you")
Note the end of the list has a few tests to make sure the regex doesn't just succeed on anything random.

Get inside of square brackets in Python

I have this string.
"ascascascasc[xx]asdasdasdasd[yy]qweqweqwe"
I want to get strings inside brackets. Like this;
"xx", "yy"
I have tried this but it did not work:
a = "ascascascasc[xx]asdasdasdasd[yy]qweqweqwe"
listinside = []
for i in range(a.count("[")):
listinside.append(a[a.index("["):a.index("]")])
print (listinside)
Output:
['[xx', '[xx']
You dont need count , you can use regex , re.findall() can do it :
>>> s="ascascascasc[xx]asdasdasdasd[yy]qweqweqwe"
>>> import re
>>> re.findall(r'\[(.*?)\]',s)
['xx', 'yy']
\[ matches the character [ literally
*? matches Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
\] matches the character ] literally
DEMO

Categories