I am trying to write a Regex validator (Python 3.8) to accept strings like these:
foo
foo,bar
foo, bar
foo , bar
foo , bar
foo, bar,foobar
This is what I have so far (but it matches only the first two cases):
^[a-zA-Z][0-9a-zA-Z]+(,[a-zA-Z][0-9a-zA-Z]+)*$|^[a-zA-Z][0-9a-zA-Z]+
However, when I add the whitespace match \w, it stops matching altogether:
^[a-zA-Z][0-9a-zA-Z]+(\w+,\w+[a-zA-Z][0-9a-zA-Z]+)*$|^[a-zA-Z][0-9a-zA-Z]+
What is the pattern to use (with explanation as to why my second pattern above is not matching).
\w matches [0-9a-zA-Z_] and it doesn't include whitespaces.
What you need is this regex:
^[a-zA-Z][0-9a-zA-Z]*(?:\s*,\s*[a-zA-Z][0-9a-zA-Z]*)*$
RegEx Demo
RegEx Details:
^: Start
[a-zA-Z][0-9a-zA-Z]*: Match a text starting with a letter followed by 0 or more alphanumeric characters
(?:: Start non-capture group
\s*,\s*: Match a comma optionally surrounded with 0 or more whitespaces on both sides
[a-zA-Z][0-9a-zA-Z]*: Match a text starting with a letter followed by 0 or more alphanumeric characters
)*: End non-capture group. Repeat this group 0 or more times
$: End
Related
I have a text like this;
[Some Text][1][Some Text][2][Some Text][3][Some Text][4]
I want to match [Some Text][2] with this regex;
/\[.*?\]\[2\]/
But it returns [Some Text][1][Some Text][2]
How can i match only [Some Text][2]?
Note : There can be any character in Some Text including [ and ] And the numbers in square brackets can be any number not only 1 and 2. The Some Text that i want to match can be at the beginning of the line and there can be multiple Some Texts
JSFiddle
The \[.*?\]\[2\] pattern works like this:
\[ - finds the leftmost [ (as the regex engine processes the string input from left to right)
.*? - matches any 0+ chars other than line break chars, as few as possible, but as many as needed for a successful match, as there are subsequent patterns, see below
\]\[2\] - ][2] substring.
So, the .*? gets expanded upon each failure until it finds the leftmost ][2]. Note the lazy quantifiers do not guarantee the "shortest" matches.
Solution
Instead of a .*? (or .*) use negated character classes that match any char but the boundary char.
\[[^\]\[]*\]\[2\]
See this regex demo.
Here, .*? is replaced with [^\]\[]* - 0 or more chars other than ] and [.
Other examples:
Strings between angle brackets: <[^<>]*> matches <...> with no < and > inside
Strings between parentheses: \([^()]*\) matches (...) with no ( and ) inside
Strings between double quotation marks: "[^"]*" matches "..." with no " inside
Strings between curly braces: \{[^{}]*} matches "..." with no " inside
In other situations, when the starting pattern is a multichar string or complex pattern, use a tempered greedy token, (?:(?!start).)*?. To match abc 1 def in abc 0 abc 1 def, use abc(?:(?!abc).)*?def.
You could try the below regex,
(?!^)(\[[A-Z].*?\]\[\d+\])
DEMO
The following works for a simple comma delimited string, that has no periods, but if periods in real numbers found it breaks.
pattern = re.compile(r"^(\w+)(,\s*\w+)*$")
How can I modify or change the above to ignore periods? But still validate the given string is comma delimited?
A sample test string is "23,HIGH,1.0,LOW,1.0,HIGH,1.0,LOW,1.0".
\w matches "word" characters: letters, digits and _. It doesn't match a dot. If you want to match dots as well, use [\w.] instead of \w:
pattern = re.compile(r"^([\w.]+)(,\s*[\w.]+)*$")
You might also want to add -, if you expect negative numbers. To put - in a character class, you either have to backslash escape it or make sure it's either the first or last character in the class:
[-.\w]
[\w.-]
[\w\-.]
If the value can only be a number, and matching dots only would not be desired you can use and alternation to match either word characters or a number.
^(?:[+-]?\d*\.?\d+|\w+)(?:,(?:[+-]?\d*\.?\d+|\w+))*$
Explanation
^ Start of string
(?: Non capture group
[+-]?\d*\.?\d+ Match an optional + or -, then optional digits, optional dot and 1+ digits
| Or
\w+ Match 1+ word characters
) Close non capture group
(?: Non capture group
, Match the comma
(?:[+-]?\d*\.?\d+|\w+) The same pattern as in the first part
)* Close non capture group and optionally repeat to also match a single occurrence
$ End of string
Regex demo
I have a text like this
s = """
...
(1) Literature
1. a.
2. b.
3. c.
...
"""
I want to cut Literature section but I have some problem with detection.
I use here
re.search("(1) Literature\n\n(.*).\n\n", s).group(1)
but search return None.
Desire output is
(1) Literature
1. a.
2. b.
3. c.
What did I do wrong?
You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.
\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)
The pattern matches:
\(1\) Literature\n\n Match (1) Literature and 2 newlines
( Capture group 1
(?: Non capture group
\d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
)+ Close non capture group and repeat it 1 or more times to match all the lines
) Close group 1
Regex demo
Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.
\(1\) Literature((?:\n(?!\(\d+\)).*)*)
Regex demo
Parentheses have a special meaning in regex. They are used to group matches.
(1) - Capture 1 as the first capturing group.
Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.
Check Demo
Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.
\(1\) Literature(.*\n){5}
Regex Demo
Note the scape characters used on parentheses around 1.
EDIT
Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.
\(1\)\sLiterature\n+((.*\n){3})
This regex will capture the below string in capturing group 1.
1. a.
2. b.
3. c.
Regex Demo
You may use this regex with a capture group:
r'\(1\)\s+Literature\s+((?:.+\n)+)'
RegEx Demo
Explanation:
\(1\): Match (1) text
\s+: Match 1+ whitespaces
Literature:
\s+:
(: Start capture group #1
(?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
): End capture group #1
Regex for capturing the generic question with that structure:
\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)
It will capture the title "Literature", then the choices in another group (for a total of 2 groups).
It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:
((\d+\.\s+.+)\n)+) then globally match to get all groups.
I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.
Trying to come up with a regex to search for keyword match at end of line and beginning of next line(if present)
I have tried below regex and does not seem to return desired result
re.compile(fr"\s(?!^)(keyword1|keyword2|keyword3)\s*\$\n\r\((\w+\W+|W+\w+))", re.MULTILINE | re.IGNORECASE)
My input for example is
sentence = """ This is my keyword
/n value"""
Output in above case should be keyword value
Thanks in advance
You could match the keyword (Or use an alternation) to match more keywords and take trailing tabs and spaces into account after the keyword and after matching a newline.
Using 2 capturing groups as in the pattern you tried:
(?<!\S)(keyword)[\t ]*\r?\n[\t ]*(\w+)(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert what is directly on the left is not a non whitespace char
(keyword) Capture in group 1 matching the keyword
[\t ]* Match 0+ tabs or spaces
\r?\n Match newline
[\t ]* Match 0+ tabs or spaces
(\w+) Capture group 2 match 1+ word chars
(?!\S) Negative lookahead, assert what is directly on the right is not a non whitespace char
Regex demo | Python demo
For example:
import re
regex = r"(?<!\S)(keyword)[\t ]*\r?\n[\t ]*(\w+)(?!\S)"
test_str = (" This is my keyword\n"
" value")
matches = re.search(regex, test_str)
if matches:
print('{} {}'.format(matches.group(1), matches.group(2)))
Output
keyword value
How about \b(keyword)\n(\w+)\b?
\b(keyword)\n(\w+)\b
\b get a word boundary
(keyword) capture keyword (replace with whatever you want)
\n match a newline
(\w+) capture some word characters, one or more
\b get a word boundary
Because keyword and \w+ are in capture groups, you can reference them as you wish later in your code.
Try it here!
My guess is that, depending of the number of new lines that you might have, an expression similar to:
\b(keyword1|keyword2|keyword3)\b[r\n]{1,2}(\S+)
might be somewhat close and the value is in \2, you can make the first group non-captured, then:
\b(?:keyword1|keyword2|keyword3)\b[r\n]{1,2}(\S+)
\1 is the value.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.