How to exclude two words from regex? - python

I have this regex:
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*([\w\s][^cui]+)
That should match
] AN 1 words 2 words 3 words
or
] AV 1 words 2 words 3 words
The words after 3 should exclude "da cui", so "da\scui", but it doesn't work. Try it here: https://regex101.com/r/kI7Tan/1
What am I doing wrong?
Sample string:
campo] AN 1 campo 2 prato con penna B sps a 1 3 da cui campo con penna C as a 1 cfr Nota filologica
Expected output: it won't match it because of the "da cui". So basically I want to match all words without the string "da cui".

The final capture group of the regex ( ([\w\s][^cui]+) ) matches ...
Exactly 1 word character due to the first character class.
This class does not match a whitespace due to the preceding \s* in the regex.
Any number of characters other than c, u, i.
If you want to exclude matches contingent on the word(s) da cui, use a negative lookahead.
\]\s*(AN|AV)\s*1\s*([\w\s]+)\s*2\s*([\w\s]+)\s*3\s*(?!.*da cui)(.*)
See the demo (regex101).
Update
Capture group reintroduced to the regex.

You may use either of the two:
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*((?:(?!cui).)*)
\]\s*(AN|AV)\s*1\s*([\w\s]+?)\s*2\s*([\w\s]+?)\s*3\s*(.*?)(?=cui|$)
See the regex demo
The (?:(?!cui).)* is a tempered greedy token that matches any char, 0 or more occurrences, as many as possible, that does not start a cui char sequence. The (.*?)(?=cui|$) pattern captures 0+ chars other than line break chars, as few as possible, up to the cui char sequence or end of string.

My interpretation of the question, as it concerns the string that follows one or more spaces after 3 (to the end of the line), is that if the string da cui is present in that string an empty string is to be saved to capture group 4, else that string is to be saved to capture group 4.
You could use the following regular expression.
\]\s*(AN|AV)\s+1\s+([\w\s]+)\s+2\s+([\w\s]+)\s+3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*)
Demo
This replaces 3\s*([\w\s][^cui]+) in the OP's regex with 3\s+((?=.*\bda cui\b)|(?!=.*\bda cui\b).*).
Python's regex engine performs the following steps after matching 3.
\s+ match 1+ spaces
( begin capture group 4
(?=.*\bda cui\b) match 0+ chars, then 'da cui' in a positive lookahead
| or
(?!=.*\bda cui\b) match 0* chars, then 'da cui' in a negative lookahead
.* match rest of line
) end capture group 4
If the positive lookahead succeeds an empty string is saved to the capture group.

Related

Python: find a string between 2 strings in text

I have a text like this
s = """
...
(1) Literature
1. a.
2. b.
3. c.
...
"""
I want to cut Literature section but I have some problem with detection.
I use here
re.search("(1) Literature\n\n(.*).\n\n", s).group(1)
but search return None.
Desire output is
(1) Literature
1. a.
2. b.
3. c.
What did I do wrong?
You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.
\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)
The pattern matches:
\(1\) Literature\n\n Match (1) Literature and 2 newlines
( Capture group 1
(?: Non capture group
\d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
)+ Close non capture group and repeat it 1 or more times to match all the lines
) Close group 1
Regex demo
Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.
\(1\) Literature((?:\n(?!\(\d+\)).*)*)
Regex demo
Parentheses have a special meaning in regex. They are used to group matches.
(1) - Capture 1 as the first capturing group.
Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.
Check Demo
Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.
\(1\) Literature(.*\n){5}
Regex Demo
Note the scape characters used on parentheses around 1.
EDIT
Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.
\(1\)\sLiterature\n+((.*\n){3})
This regex will capture the below string in capturing group 1.
1. a.
2. b.
3. c.
Regex Demo
You may use this regex with a capture group:
r'\(1\)\s+Literature\s+((?:.+\n)+)'
RegEx Demo
Explanation:
\(1\): Match (1) text
\s+: Match 1+ whitespaces
Literature:
\s+:
(: Start capture group #1
(?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
): End capture group #1
Regex for capturing the generic question with that structure:
\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)
It will capture the title "Literature", then the choices in another group (for a total of 2 groups).
It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:
((\d+\.\s+.+)\n)+) then globally match to get all groups.

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Match capture group fixed number of times

I have a bunch of 5-letter strings. For each string, I would like to match only if the string contains 3 instances of the same letter, i.e.:
Case 1: 'aabbc' -> no match
Case 2: 'bbbcc' -> match 'bbb'
Case 3: 'ddcdc' -> match 'ddd'
My best regex attempt is:
(.){1}(?!\1)*\1{1}(?!\1)*\1{1}
This works for case 1 (where there is no match) and case 2 (where the 3 instances are adjacent), but not for case 3 (where the 3 instances are separated by at least one other letter).
Is there a regex that will work for case 3? Ideally I would like to also extract the locations of the 3 matching instances from the string.
The below pattern catches what you need and should capture the edge cases. The first capturing group can be modified to just be the subset of characters you need to search for if there is a limited list of expected values. Putting the \1s in capturing groups means that you should be able to extract the index of the capturing groups from the match via .start() (getting the starting index of the capturing group), meeting your bonus goal.
>>> pattern = r"(.).*(\1).*(\1)"
>>> x = re.search(pattern, "ababb")
>>> x.groups()
('b', 'b', 'b')
>>> x.start(1)
1
>>> x.start(2)
3
>>> x.start(3)
4
I think the pattern ([a-z]).*?\1.*?\1 does what you want, although there are likely to be edge cases that would complicate it.
The pattern looks for a lowercase letter three times, with 0 or more characters between them.
You could then extract just the capturing groups to get your match locations.
At the moment, the pattern only looks for any lowercase letter repeated three times, but you could change the initial capturing group - ([a-z]) - if you wanted to capture something else.
Demo
You can use the following regex to determine if one character appears at least three times.
^.*(.).*\1.*\1
Demo
This does not check that the characters are letters but it does work with any characters. To restrict to letters change each . to [a-z] or [a-zA-Z], as appropriate.
To see if one character appears exactly 3 times, change the regex to:
^(?!.*(.)(?:.*\1){3,}).*(.).*\2.*\2
Demo
^ # match beginning of line
(?! # begin negative lookahead
.* # match 0+ chars
(.) # match a char in cap grp 2
(?:.*\1) # match 0+ chars followed by content of cap grp 1
# in a non-cap grp
{3,} # execute non-cap grp 3+ times
) # end negative lookahead
.* # match 0+ chars
(.) # match char in cap grp 2
.* # match 0+ chars
\2 # match content of cap grp 2
.* # match 0+ chars
\2 # match content of cap grp 2

regex search remove word

I want remove first 4 words from paragraph
Original : Mywebsite 21 12 34 have 10000 traffic
What i want result : have 10000 traffic
i have 1000 of line same as original paragraph ( Mywebsite 21 12 34 have 10000 traffic)
i have regex search code which is work like this :
Below code is remove first word from sentence :
^\w+\s+(.*) = replace with $1
Following code will remove all numbers from line :
[0-9 ]+ = replace with space
I want combine above code, and make one regex search code work as i explain above, but not to affect any other words same line .
If your lines are all in the exact same format, i.e. if you always need to remove the first 4 words you can do something like this which is way simpler to understand than a RegEx:
# Iterate through all your lines
for line in lines:
# Split the line string on spaces to create an array of words.
words = line.split(' ')
# Exclude the 4 first words and re-join the string with the remaining words.
line = ' '.join(words[4:])
The pattern that you tried ^\w+\s+(.*) matches 1+ word chars, 1+ whitespace chars and then any char except a newline until the end of the string so that will match the whole string.
To remove the first word and the following 3 times 2 digits, you might use:
^\s*\w+(?: \d{2}){3}\s*
^ Start of string
\s* Match 0+ whitespace chars
\w+ Match 1+ word chars
(?: \d{2}){3} Repeat 3 times matching a space and 2 digits
\s* Match 0+ whitespace chars
Regex demo | Python demo
Note that \s also matches a newline. If you only want to match spaces or tabs you could use [ \t] instead.
You may use
re.sub(r'^(\w+\s)[\d\s]+', r'\1', text)
See the regex demo a
The pattern will match
^ - start of string
(\w+\s) - Capturing group 1: one or more word chars and a whitespace
[\d\s]+ - 1+ whitespace or digit chars.
Python demo:
import re
rx = re.compile(r"^(\w+\s)[\d\s]+")
s = "Mywebsite 21 12 34 have 10000 traffic"
print( rx.sub(r"\1", s) ) # => Mywebsite have 10000 traffic

How to extract data from string

My code is
import regex
word = '\x02|1280|SELECT|35;36|="214554"'.encode('ascii')
pattern = r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}+|;*)\|="(\w+)"'.encode('ascii')
print(regex.match(pattern, word).group(4))
and I'm interested in group 4 -> (\d{1,2}+|;*) that can have following pattern
|one digit number|
|two-digit number|
|one/two-digit number; one/two-digit number; ... ; one/two-digit number|
I have tried different combination, but as I'm new to regex none of them returns data from group.
How about changing the pattern for group 4 to: (\d{1,2}(?:;\d{1,2})*)?
\d{1,2} represents one or two digits
(?:;\d{1,2})* represents zero or more non-capturing groups that include a semi colon ; followed by one or two digits numbers
Important to mark the group as non-capturing by adding a (?: at the start
Regex101 Demo
Hope this helps!
The \d{1,2}+|;* pattern matches 1 or 2 digits possessively or 0+ semi-colon. So, it is not what you need.
You need to write the pattern like this:
r'^(\x02)\|(\d{1,4})\|(SELECT|UPDATE|INSERT)\|(\d{1,2}(?:;\d{1,2})*)\|="(\w+)"'
See the Python demo.
The Group 4 pattern will look like (\d{1,2}(?:;\d{1,2})*):
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing group that matches sequences of....
; - a semi-colon
\d{1,2} - 1 or 2 digits
)* - .... zero or more occurrences.

Categories