python regex match optional square brackets - python

I have the following strings:
1 "R J BRUCE & OTHERS V B J & W L A EDWARDS And Ors CA CA19/02 27 February 2003",
2 "H v DIRECTOR OF PROCEEDINGS [2014] NZHC 1031 [16 May 2014]",
3 '''GREGORY LANCASTER AND JOHN HENRY HUNTER V CULLEN INVESTMENTS LIMITED AND
ERIC JOHN WATSON CA CA51/03 26 May 2003'''
I am trying to find a regular expression which matches all of them. I don't know how to match optional square brackets around the date at the end of the string eg [16 May 2014].
casename = re.compile(r'(^[A-Z][A-Za-z\'\(\) ]+\b[v|V]\b[A-Za-z\'\(\) ]+(.*?)[ \[ ]\d+ \w+ \d\d\d\d[\] ])', re.S)
The date regex at the end only matches cases with dates in square bracket but not the ones without.
Thank to everybody who answered. #Matt Clarkson what I am trying to match is a judicial decision 'handle' in a much larger text. There is a large variation within those handles, but they all start at the beginning of a line have 'v' for versus between the party names and a date at the end. Mostly the names of the parties are in capital but not exclusively. I am trying to have only one match per document and no false positives.

I got all of them to match using this (You'll need to add the case-insensitive flag):
(^[a-z][a-z\'&\(\) ]+\bv\b[a-z&\'\(\) ]+(?:.*?) \[?\d+ \w+ \d{4}\]?)
Regex Demo
Explanation:
( Begin capture group
[a-z\'&\(\) ]+ Match one or more of the characters in this group
\b Match a word boundary
v Match the character 'v' literally
\b Match a word boundary
[a-z&\'\(\) ]+ Match one or more of the characters in this group
(?: Begin non-capturing group
.*? Match anything
) End non-capturing group
\[?\d+ \w+ \d{4}\]? Match a date, optionally surrounded by brackets
) End capture group

How to make Square brackets optional, can be achieved like this:
[\[]* with the * it makes the opening [ optional.
A few recommendations if I may:
This \d\d\d\d could be also expressed like this as well \d{4}
[v|V] in regex what is inside the [] is already one or other | is not necessary [vV]
And here is what an online demo

Using your regex and input strings, it looks like you will match only the 2nd line (if you get rid of the '^' at the beginning of the regex. I've added inline comments to each section of the regular expression you provided to make it more clear.
Can you indicate what you are trying to capture from each line? Do you want the entire string? Only the word immediately preceding the lone letter 'v'? Do you want the date captured separately?
Depending on the portions that you wish to capture, each section can be broken apart into their respective match groups: regex101.com example. This is a little looser than yours (capturing the entire section between quotation marks instead of only the single word immediately preceding the lone 'v'), and broken apart to help readability (each "group" on its own line).
This example also assumes the newline is intentional, and supports the newline component (warning: it COULD suck up more than you intend, depending on whether the date at the end gets matched or not).

Related

Regex Name Retrieval

I'm attempting to write a simple Regex expression that retrieves names for me based on the presence of a character string at the end of a line.
I've been successful at isolating each of these patterns using pythex in my data set, but I have been unable to match them as a conditional group.
Can someone explain what I am doing wrong?
Data Example
Mark Samson: CA
Sam Smith: US
Dawn Watterton: CA
Neil Shughar: CA
Fennial Fontaine: US
I want to be able to create a regex expression that uses the end of each line as the condition of the group match - i.e I want a list of those who live in the US from this dataset. I have used each of these expressions in isolation and it seems to work in matching what I am looking for. What I need is help in making the below a grouped search.
Does anyone have any suggestion?
([US]$)([A-Z][a-z]+)
Something like the following?
(\w+[ \w]*): US
You say "I have been unable to match them as a conditional group", but you are not using any conditional groups. ([US]$)([A-Z][a-z]+) is an example of a pattern that never matches any string as it matches U or S, then requires an end of string, and then matches an uppercase ASCII letter and one or more ASCII lowercase letters.
You want any string from start till a colon, whitespaces, and US substring at the end of string.
Hence, use
.+?(?=:\s*US$)
^(.+?):\s*US$
See the regex demo. Details:
.+? - one or more chars other than line break chars as few as possible
(?=:\s*US$) - a positive lookahead that matches a location immediately followed with :, zero or more whitespaces, US string and the end of string.
See a Python demo:
import re
texts = ["Mark Samson: CA", "Sam Smith: US", "Dawn Watterton: CA", "Neil Shughar: CA", "Fennial Fontaine: US"]
for text in texts:
match = re.search(r".+?(?=:\s*US$)", text)
if match:
print(match.group()) # With r"^(.+?):\s*US$" regex, use match.group(1) here
Output:
Sam Smith
Fennial Fontaine

Python: find a string between 2 strings in text

I have a text like this
s = """
...
(1) Literature
1. a.
2. b.
3. c.
...
"""
I want to cut Literature section but I have some problem with detection.
I use here
re.search("(1) Literature\n\n(.*).\n\n", s).group(1)
but search return None.
Desire output is
(1) Literature
1. a.
2. b.
3. c.
What did I do wrong?
You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.
\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)
The pattern matches:
\(1\) Literature\n\n Match (1) Literature and 2 newlines
( Capture group 1
(?: Non capture group
\d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
)+ Close non capture group and repeat it 1 or more times to match all the lines
) Close group 1
Regex demo
Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.
\(1\) Literature((?:\n(?!\(\d+\)).*)*)
Regex demo
Parentheses have a special meaning in regex. They are used to group matches.
(1) - Capture 1 as the first capturing group.
Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.
Check Demo
Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.
\(1\) Literature(.*\n){5}
Regex Demo
Note the scape characters used on parentheses around 1.
EDIT
Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.
\(1\)\sLiterature\n+((.*\n){3})
This regex will capture the below string in capturing group 1.
1. a.
2. b.
3. c.
Regex Demo
You may use this regex with a capture group:
r'\(1\)\s+Literature\s+((?:.+\n)+)'
RegEx Demo
Explanation:
\(1\): Match (1) text
\s+: Match 1+ whitespaces
Literature:
\s+:
(: Start capture group #1
(?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
): End capture group #1
Regex for capturing the generic question with that structure:
\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)
It will capture the title "Literature", then the choices in another group (for a total of 2 groups).
It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:
((\d+\.\s+.+)\n)+) then globally match to get all groups.

Regular expressions to match numbers (both regular and romans)

I'm trying to write a regex to match both regular numbers (1, 2, 42...) and roman ones (X, VII...).
But the one I've currently wrote:
\b((?=[MDCLXVI])M{0,3}(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3}))\b|\b\d+\b
is matching more than expected.
It has 9 matches, while I expect only 4:
XII
VII
2
12
How can I fix it?
You don't really need any lookahead in your regex.
Your regex can be simplified and refactored into this:
/
\b
(?:
[MDCLXVI]M{0,3}C[MD]
|
D?C{0,3}X[CL]
|
L?X{0,3}I[XV]
|
[XV]I{0,3}
|
I{1.3}
|
\d+
)
\b
/gix
Updated RegEx Demo
Note that I have used x (extended mode) in regex so that regex will ignore all whitespaces which allows you to have proper indentation between multiple alternations to make your regex more readable. I don't know all permutations of roman number so I suggest you to please recheck each and every alternation.
The reason for that is the possibility of a zero-width match with just word boundary patterns (i.e.\b(?=[MDCLXVI])\b matches before any word starting with Roman number letter).
You need to precise the word boundaries, make the leading one match only before a word char, and the last one to match only after a word char:
(?<!\w)(?:(?=[MDCLXVI])M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})|\d+)(?!\w)
See the regex demo.
Here, (?<!\w) acts as a word boundary that fails the match if, immediately to the left of the current location, there is a word char, and (?!\w) acts a word boundary that fails the match if, immediately to the right of the current location, there is a word char.

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.
You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.
The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.
This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

Regular expression in python: removing square brackets and parts of the phrase inside of the brackets

I have a wikipedia dump and struggling with finding appropriate regex patter to remove the double square brackets in the expression. Here is the example of the expressions:
line = 'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the [[herbicide]]s and [[defoliant]]s used by the [[United States armed forces|U.S. military]] as part of its [[herbicidal warfare]] program, [[Operation Ranch Hand]], during the [[Vietnam War]] from 1961 to 1971.'
I am looking to remove all of the square brackets with the following conditions:
if there is no vertical separator within square bracket, remove the brackets.
Example : [[herbicide]]s becomes herbicides.
if there is a vertical separator within the bracket, remove the bracket and only use the phrase after the separator.
Example : [[United States armed forces|U.S. military]] becomes U.S. military.
I tried using re.match and re.search but was not able to arrive to the desired output.
Thank you for your help!
What you need is re.sub. Note that both square brackets and pipes are meta-characters so they need to be escaped.
re.sub(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]', r'\1', line)
The \1 in the replacement string refers to what was matched inside the parentheses, that do not start with ?: (i.e. in any case the text you want to have).
There are two caveats. This allows for only a single pipe between the opening and closing brackets. If there are more than one you would need to specify whether you want everything after the first or everything after the last one. The other caveat is that single ] between opening and closing brackets are not allowed. If that is a problem, there would still be a regex solution but it would be considerably more complicated.
For a full explanation of the pattern:
\[\[ # match two literal [
(?: # start optional non-capturing subpattern for pre-| text
[^\]|] # this looks a bit confusing but it is a negated character class
# allowing any character except for ] and |
* # zero or more of those
\| # a literal |
)? # end of subpattern; make it optional
( # start of capturing group 1 - the text you want to keep
[^\]|]* # the same character class as above
) # end of capturing group
\]\] # match two literal ]
>>> import re
>>> re.sub(r'\[\[(?:[^|\]]*\|)?([^\]]*)]]', r'\1', line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'
Explanation:
\[\[ # match two opening square brackets
(?: # start optional non-capturing group
[^|\]]* # match any number of characters that are not '|' or ']'
\| # match a '|'
)? # end optional non-capturing group
( # start capture group 1
[^\]]* # match any number of characters that are not ']'
) # end capture group 1
]] # match two closing square brackets
By replacing matches of the above regex with the contents of capture group 1, you will get the contents of the square brackets, but only what is after the separator if it is present.
You can use re.sub to just find everything between [[ and ]]and I think it's slightly easier to pass in a lambda function to do the replacement (to take everything from the last '|' onwards)
>>> import re
>>> re.sub(r'\[\[(.*?)\]\]', lambda L: L.group(1).rsplit('|', 1)[-1], line)
'is the combination of the code names for Herbicide Orange (HO) and Agent LNX, one of the herbicides and defoliants used by the U.S. military as part of its herbicidal warfare program, Operation Ranch Hand, during the Vietnam War from 1961 to 1971.'

Categories