Extracting the text after the initial substrings between square brackets - python

I would like to extract the substring from the string, such as
Case 1:
text = "some_txt" # → some_txt
Case2:
text = "[info1]some_txt" # → some_txt
Case3:
text = "[info1][info2] some_text" # → some_txt
Case4:
text = "[info1][info2] some_text_with_[___]_abc" # → some_text_with_[___]_abc
What I did was
match = re.search("^\[.+\] (.*)", text)
if match:
result = match.group(1)
It works okay except case 4, which gives abc only. I want to get some_text_with_[___]_abc instead.
Any help will be greatly appreciated.

With your current code, you can use
r"^(?:\[[^][]+](?:\s*\[[^][]+])*)?\s*(.*)"
See the regex demo.
If you are not actually interested in whether there is a match or not, you may use re.sub to remove these bracketed substrings from the start of the string using
re.sub(r'^\[[^][]+](?:\s*\[[^][]+])*\s*', '', text)
See another regex demo.
Regex details
^ - start of string
(?:\[[^][]+](?:\s*\[[^][]+])*)? - an optional occurrence of
\[[^][]+] - a [, then any one or more chars other than [ and ] as many as possible and then a ]
(?:\s*\[[^][]+])* - zero or more occurrences of zero or more whitespaces and then a [, then any one or more chars other than [ and ] as many as possible and then a ]
\s* - zero or more whitespaces
(.*) - Group 1: any zero or more chars other than line break chars, as many as possible.

Related

How to use "?" in regular expression to change a qualifier to be non-greedy and find a string in the middle of the data? [duplicate]

I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO

How to match and replace everything after a specific word until it reaches comma in a list of strings using python?

I have a DataFrame with list of strings as below
df
text
,info_concern_blue,replaced_mod,replaced_rad
,info_concern,info_concern_red,replaced_unit
,replaced_link
I want to replace all words after info_concern for eg. info_concern_blue/info_concern_red to info_concern until it encounters comma.
I tried the following regex:
df['replaced_text'] = [re.sub(r'info_concern[^,]*.+?,', 'info_concern,',
x) for x in df['text']]
But this is giving me incorrect results.
Desired output:
replaced_text
,info_concern,replaced_mod,replaced_rad
,info_concern,info_concern,replaced_unit
,replaced_link
Please suggest/advise.
You can use
df['replaced_text'] = df['text'].str.replace(r'(info_concern)[^,]*', r'\1', regex=True)
See the regex demo.
If you want to make sure the match starts right after a comma or start of string, add the (?<![^,]) lookbehind at the start of the pattern:
df['replaced_text'] = df['text'].str.replace(r'(?<![^,])(info_concern)[^,]*', r'\1', regex=True)
See this regex demo. Details:
(?<![^,]) - right before, there should be either , or start of string
(info_concern) - Group 1: info_concern string
[^,]* - zero or more chars other than a comma.
The \1 replacement replaces the match with Group 1 value.
The issue is that the pattern info_concern[^,]*.+?, matches till before the first comma using [^,]*
Then this part .+?, matches at least a single character (which can also be a comma due to the .) and then till the next first comma.
So if there is a second comma, it will overmatch and remove too much.
You could also assert info_concern to the left, and match any char except a comma to be removed by an empty string.
If there has to be a comma to the right, you can assert it.
(?<=\binfo_concern)[^,]*(?=,)
The pattern matches:
(?<=\binfo_concern) Positive lookbehind, assert info_concern to the left
[^,]* Match 0+ times any char except ,
(?=,) Positive lookahead, assert , directly to the right
Regex demo
If the comma is not mandatory, you can omit the lookahead
(?<=\binfo_concern)[^,]*
For example
import pandas as pd
texts = [
",info_concern_blue,replaced_mod,replaced_rad",
",info_concern,info_concern_red,replaced_unit",
",replaced_link"
]
df = pd.DataFrame(texts, columns=["text"])
df['replaced_text'] = df['text'].str.replace(r'(?<=\binfo_concern)[^,]*(?=,)', '', regex=True)
print(df)
Output
text replaced_text
0 ,info_concern_blue,replaced_mod,replaced_rad ,info_concern,replaced_mod,replaced_rad
1 ,info_concern,info_concern_red,replaced_unit ,info_concern,info_concern,replaced_unit
2 ,replaced_link ,replaced_link

Regex to ignore data between brackets

I replace characters { , } , : , , with an empty string using below:
This code :
s = "\":{},"
print(s)
print(re.sub(r'\"|{|}' , "",s))
prints:
":{},
:,
which is expected.
I'm attempting to modify the regex to ignore everything between open and closed brackets. So for the string "\":{},[test,test2]" just :,[test,test2] should be returned.
How to modify the regex such that data contained between [ and ] is not applied by the regex.
I tried using:
s = "\":{},[test1, test2]"
print(s)
print(re.sub(r'[^a-zA-Z {}]+\"|{|}' , "",s))
(src: How to let regex ignore everything between brackets?)
None of the , values are replaced .
Assuming your brackets are balanced/unescaped, you may use this regex with a negative lookahead to assert that matched character is not inside [...]:
>>> import re
>>> s = "\":{},[test1,test2]"
>>> print (re.sub(r'[{}",](?![^[]*\])', '', s))
:[test1,test2]
RegEx Demo
RegEx Details:
[{}",]: Match one of those character inside [...]
(?![^[]*\]): Negative lookahead to assert that we don't have a ] ahead of without matching any [ in between, in other words matched character is not inside [...]
If you want to remove the {, }, , and " not inside square brackets, you can use
re.sub(r'(\[[^][]*])|[{}",]', r'\1', s)
See the regex demo. Note you can add more chars to the character set, [{}"]. If you need to add a hyphen, make sure it is the last char in the character set. Escape \, ] (if not the first, right after [) and ^ (if it comes first, right after [).
Details:
(\[[^][]*]) - Capturing group 1: a [...] substring
| - or
[{}",] - a {, }, , or " char.
See a Python demo using your sample input:
import re
s = "\":{},[test1, test2]"
print( re.sub(r'(\[[^][]*])|[{}",]', r'\1', s) )
## => :[test1, test2]

Regular expression for printing integers within brackets

First time ever using regular expressions and can't get it working although there's quite a few examples in stackoverflow already.
How can I extract integers which are in a string inside bracket?
Example:
dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x
would return
[43, 32, -99]
'+' and '-' is okay, if it's in the beginning of the brackets, but not okay if it's in the middle or end. If the '+' sign is in the beginning, it should not be taken into account. (+54 --> 54)
Been trying :
re.findall('\[[-]?\d+\]',str)
but it's not working the way I want.
If you need to fail the match in [ +-34 ] (i.e. if you needn't extract a negative number if there is a + before it) you will need to use
\[\s*(?:\+|(-))?(\d+)\s*]
and when getting a match, concat the Group 1 and Group 2 values. See this regex demo.
Details
\[ - a [ char
\s* - 0+ whitespaces
\+? - an optional + char
(-?\d+) - Capturing group 1 (the actual output of re.findall): an optional - and 1+ digits
\s* - 0+ whitespaces
] - a ] char.
In Python,
import re
text = "dijdi[d43] d5[55++][ 43] [+32]dm dij [ -99]x"
numbers_text = [f"{x}{y}" for x, y in re.findall(r'\[\s*(?:\+|(-))?(\d+)\s*]', text)]
numbers = list(map(int, numbers_text))
# => [43, 32, -99] for both
If you want to extract integers from a string the code that I use is this:
def stringToNumber(inputStr):
myNumberList = []
for s in inputStr.split():
newString = ''.join(i for i in s if i.isdigit())
if (len(newString) != 0):
myNumberList.append(newString)
return myNumberList
I hope it works for you.
If you've not done so I suggest you switch to the PyPI regex module. Using it here with regex.findall and the following regular expression allows you to extract just what you need.
r'\[ *\+?\K-?\d+(?= *\])'
regex engine <¯\(ツ)/¯> Python code
At the regex tester pass your cursor across the regex for details about individual tokens.
The regex engine performs the following operations.
\[ : match '['
\ * : match 0+ spaces
\+? : optionally match '+'
\K : forget everything matched so far and reset
start of match to current position
-? : optionally match '-'
\d+ : match 1+ digits
(?= *\]) : use positive lookahead to assert the last digit
: matched is followed by 0+ spaces then ']'

A way to match a SSHA hash using a regular expression

I'm trying to match four hashes that look like this:
{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c=
{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c=
I've successfully matched the first two with this regular expression: \D{5,}[a-zA-Z0-9]\w+\(?= however I am unable to get a full match on the third or the fourth one. What is a better regular expression to match the given hashes?
Note that \D{5,} matches 5 or more non-digit chars, and then [a-zA-Z0-9] matches an ASCII letter or digit and \w+ matches 1+ letters/digits/_. So, if you have - or / in the string, it won't get matches. Or if the first 5 chars contain a digit.
I suggest the following pattern:
\{[^{}]*}[a-zA-Z0-9][\w/-]+=?
See the regex demo.
It matches:
\{[^{}]*} - a {, then 0+ chars other than { and } and then } (note you may further precise it: \{\w+} to match {, 1 or more letters/digits/_, and then }, or even \{(?:SS?HA|MD5)} to match SHA, SSHA or MD5 enclosed with {...})
[a-zA-Z0-9] - an ASCII letter or digit
[\w/-]+ - 1 or more word chars (letters, digits or _)
=? - an optional, 1 or 0 occurrences (due to the ? quantifier) = symbols (greedy ? makes it match a = if it is found).
Python demo:
import re
s = """
TEXT {SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=
{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4= and some more
{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c text here
{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c= maybe."""
rx = r"\{[^{}]*}[a-zA-Z0-9][\w/-]+=?"
print(re.findall(rx, s))
# => ['{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M=', '{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4=', '{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c', '{MD5}5/DNVWwyafo-oIEzHnhv30rSN7c=']
I would suggest something along these lines:
\{[SHAMD5]{3,4}\}[^=]+=?
It will match a { then 3 or 4 characters that are the combinations you have listed of characters. You can change that to [A-Z0-9] to broaden it, but I like to keep it tighter to start. Then a }. Then all (at least 1) non = characters. Ending with an optional = character. Here is my python demo:
import re
textlist = [
"{SHA}qUqP5cyxm6YcTAhz05Hph5gvu9M="
,"{SSHA}QhikpbGFa5NAckbjcZ_K_WoJNh4="
,"{SSHA}5_DNVWsyofo-oIEzHnhv30rSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
,"{MD5}5/DNVWwyafo-pIEaHNhv39sSN7c"
,"test for break below"
,"{WORD}stuff="
,"{MD55/DNVWwyafo-pIEaHNhv39sSN7c="
,"MD5}5/DNVWwyafo-pIEaHNhv39sSN7c="
]
for text in textlist:
if re.search("\{[SHAMD5]{3,4}\}[^=]+=?", text):
print ("match")
else:
print ("no soup for you")
Note the end of the list has a few tests to make sure the regex doesn't just succeed on anything random.

Categories