Regex Search end of line and beginning of next line - python

Trying to come up with a regex to search for keyword match at end of line and beginning of next line(if present)
I have tried below regex and does not seem to return desired result
re.compile(fr"\s(?!^)(keyword1|keyword2|keyword3)\s*\$\n\r\((\w+\W+|W+\w+))", re.MULTILINE | re.IGNORECASE)
My input for example is
sentence = """ This is my keyword
/n value"""
Output in above case should be keyword value
Thanks in advance

You could match the keyword (Or use an alternation) to match more keywords and take trailing tabs and spaces into account after the keyword and after matching a newline.
Using 2 capturing groups as in the pattern you tried:
(?<!\S)(keyword)[\t ]*\r?\n[\t ]*(\w+)(?!\S)
Explanation
(?<!\S) Negative lookbehind, assert what is directly on the left is not a non whitespace char
(keyword) Capture in group 1 matching the keyword
[\t ]* Match 0+ tabs or spaces
\r?\n Match newline
[\t ]* Match 0+ tabs or spaces
(\w+) Capture group 2 match 1+ word chars
(?!\S) Negative lookahead, assert what is directly on the right is not a non whitespace char
Regex demo | Python demo
For example:
import re
regex = r"(?<!\S)(keyword)[\t ]*\r?\n[\t ]*(\w+)(?!\S)"
test_str = (" This is my keyword\n"
" value")
matches = re.search(regex, test_str)
if matches:
print('{} {}'.format(matches.group(1), matches.group(2)))
Output
keyword value

How about \b(keyword)\n(\w+)\b?
\b(keyword)\n(\w+)\b
\b get a word boundary
(keyword) capture keyword (replace with whatever you want)
\n match a newline
(\w+) capture some word characters, one or more
\b get a word boundary
Because keyword and \w+ are in capture groups, you can reference them as you wish later in your code.
Try it here!

My guess is that, depending of the number of new lines that you might have, an expression similar to:
\b(keyword1|keyword2|keyword3)\b[r\n]{1,2}(\S+)
might be somewhat close and the value is in \2, you can make the first group non-captured, then:
\b(?:keyword1|keyword2|keyword3)\b[r\n]{1,2}(\S+)
\1 is the value.
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Related

Regex pattern matching comma delimited values with spaces allowed around comma

I am trying to write a Regex validator (Python 3.8) to accept strings like these:
foo
foo,bar
foo, bar
foo , bar
foo , bar
foo, bar,foobar
This is what I have so far (but it matches only the first two cases):
^[a-zA-Z][0-9a-zA-Z]+(,[a-zA-Z][0-9a-zA-Z]+)*$|^[a-zA-Z][0-9a-zA-Z]+
However, when I add the whitespace match \w, it stops matching altogether:
^[a-zA-Z][0-9a-zA-Z]+(\w+,\w+[a-zA-Z][0-9a-zA-Z]+)*$|^[a-zA-Z][0-9a-zA-Z]+
What is the pattern to use (with explanation as to why my second pattern above is not matching).
\w matches [0-9a-zA-Z_] and it doesn't include whitespaces.
What you need is this regex:
^[a-zA-Z][0-9a-zA-Z]*(?:\s*,\s*[a-zA-Z][0-9a-zA-Z]*)*$
RegEx Demo
RegEx Details:
^: Start
[a-zA-Z][0-9a-zA-Z]*: Match a text starting with a letter followed by 0 or more alphanumeric characters
(?:: Start non-capture group
\s*,\s*: Match a comma optionally surrounded with 0 or more whitespaces on both sides
[a-zA-Z][0-9a-zA-Z]*: Match a text starting with a letter followed by 0 or more alphanumeric characters
)*: End non-capture group. Repeat this group 0 or more times
$: End

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Regex - How do i find this specific slice of string inside a bigger whole string

following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo

need regex expression to avoid " \n " character

I want to apply regex to the below string in python Where i only want to capture Model Number : 123. I tried the below regex but it didn't fetch me the result.
string = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'(?s)Model Number:.*?\n',string)
Output is as follows Model Number : 123\n How can i avoid \n at the end of the output?
Remove the DOTALL (?s) inline modifier to avoid matching a newline char with ., add \s* after Number and use .* instead of .*?\n:
r'Model Number\s*:.*'
See the regex demo
Here, Model Number will match a literal substring, \s* will match 0+ whitespaces, : will match a colon and .* will match 0 or more chars other than line break chars.
Python demo:
import re
s = """Model Number : 123
Serial Number : 456"""
model_number = re.findall(r'Model Number\s*:.*',s)
print(model_number) # => ['Model Number : 123']
If you need to extract just the number use
r'Model Number\s*:\s*(\d+)'
See another regex demo and this Python demo.
Here, (\d+) will capture 1 or more digits and re.findall will only return these digits. Or, use it with re.search and once the match data object is obtained, grab it with match.group(1).
NOTE: If the string appears at the start of the string, use re.match. Or add ^ at the start of the pattern and use re.M flag (or add (?m) at the start of the pattern).
you can use strip() function
model_number.strip()
this will remove all white spaces

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Categories