Regex should fail if pattern is followed by another pattern - python

I need to detect #username mentions within a message, but NOT if it is in the form of #username[user_id]. I have a regex that can match the #username part, but am struggling to negate the match if it is followed by \[\d\].
import re
username_regex = re.compile(r'#([\w.#-]+[\w])')
usernames = username_regex.findall("Hello #kevin") # correctly finds kevin
usernames = username_regex.findall("Hello #kevin.") # correctly finds kevin
usernames = username_regex.findall("Hello #kevin[1].") # shouldn't find kevin but does
The regex allows for usernames that contain #, . and -, but need to end with a \w character ([a-zA-Z0-9_]). How can I extend the regex so that it fails if the username is followed by the userid in the [1] form?
I tried #([\w.#-]+[\w])(?!\[\d+\]) but then it matches kevi 🤔
I'm using Python 3.10.

You can "emulate" possessive matching with
#(?=([\w.#-]*\w))\1(?!\[\d+\])
See the regex demo.
Details:
# - a # char
(?=([\w.#-]*\w)) - a positive lookahead that matches and captures into Group 1 zero or more word, ., # and - chars, as many as possible, and then a word char immediately to the right of the current position (the text is not consumed, the regex engine index stays at the same location)
\1 - the text matched and captured in Group 1 (this consumes the text captured with the lookahead pattern, mind that backreferences are atomic by nature)
(?!\[\d+\]) - a negative lookahead that fails the match if there is [ + one or more digits + ] immediately to the right of the current location.

Related

regex match all after a string with positive lookbehind and input it behind every selection

copyright: hololive hololive_english
character: mori_calliope takanashi_kiara takanashi_kiara_(phoenix)
artist: xu_chin-wen
species:
meta: web
I want to select every word after eg:character: so i can put eg:character: behind every selection,
character:mori_calliope character:takanashi_kiara chararcter:takanashi_kiara_(phoenix)
the closest thing i got is
(?<=(\w*):\s*\S*\s.*)(?<=\s)(?=\S)
which works properly but it breaks when there is a single entry on eg:character: something or when its empty
i would be really thankfull if someone would help
You should install PyPi regex module and use
regex.sub(r'(?<=(\w+):.*)(?<=\s)(?=\S)', r'\1:', text)
# or
# regex.sub(r'(?<=(\w+:).*)(?<=\s)(?=\S)', r'\1', text)
See the regex demo.
Details:
(?<=(\w+):.*) - a positive lookbehind that matches a location that is immediately preceded with any word (captured into Group 1) followed by a : char and then any zero or more chars other than line break chars as many as possible
(?<=\s)` - a positive lookbehind that matches a location that is immediately preceded with a whitespace char
(?=\S) - a positive lookahead that matches a location that is immediately followed with a non-whitespace char.
See the Python demo:
import regex
text = "copyright: hololive hololive_english\ncharacter: mori_calliope takanashi_kiara takanashi_kiara_(phoenix)\nartist: xu_chin-wen\nspecies:\nmeta: web"
print( regex.sub(r'(?<=(\w+):.*)(?<=\s)(?=\S)', r'\1:', text) )
Output:
copyright: copyright:hololive copyright:hololive_english
character: character:mori_calliope character:takanashi_kiara character:takanashi_kiara_(phoenix)
artist: artist:xu_chin-wen
species:
meta: meta:web

Pulling out valid twitter names using re module in Python

1. Background info
I have string which contains valid and invalid twitter user names as such:
#moondra2017.org,#moondra,Python#moondra,#moondra_python
In the above string, #moondra and #moondra_python are valid usernames. The rest are not.
1.1 Goal
By using \b and/or \B as a part of regex pattern, I need to extract the valid usernames.
P.S I must use \b and/or \B as the part of the regex, that is part of this goal.
2. My Failed Attempt
import re
# (in)valid twitter user names
un1 = '#moondra2017.org' # invalid
un2 = '#moondra' # << valid, we want this
un3 = 'Python#moondra' # invalid
un4 = '#moondra_python' # << validwe want this
string23 = f'{un1},{un2},{un3},{un4}'
pattern = re.compile(r'(?:\B#\w+\b(?:[,])|\B#\w+\b)') # ??
print('10:', re.findall(pattern, string23)) # line 10
2.1 Observed: The above code prints:
10: ['#moondra2017', '#moondra,', '#moondra_python'] # incorrect
2.2 Expected:
10: ['#moondra', '#moondra_python'] # correct
I will answer assuming that the mentions are always in the format as shown above, comma-separated.
Then, to match the end of a mention, you need to use a comma boundary, (?![^,]) or a less efficient but online tester friendly (?=,|$).
pattern = re.compile(r'\B#\w+\b(?![^,])')
pattern = re.compile(r'\B#\w+\b(?=,|$)')
See the regex demo and the Python demo
Details
\B - a non-word boundary, there must be start of string or a non-word char immediately to the left of the current location
# - a # char
\w+ - 1+ word chars (letters, digits or _)
\b - a word boundary (the next char should be a non-word char or end of string)
(?![^,]) - the next char cannot be a char different from , (so it should be , or end of string).

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Regex - How do i find this specific slice of string inside a bigger whole string

following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo

extract word and before word and insert between ”_” in regex

I need some help on declaring a regex. My inputs are like the following:
I need to extract word and before word and insert between ”_” in regex:python
Input
Input
s2 = 'Some other medical terms and stuff diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
# my regex pattern
re.sub(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}diagnosis", r"\1_", s2)
Desired Output:
s2 = 'Some other medical terms and stuff_diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
You have no capturing group defined in your regex, but are using \1 placeholder (replacement backreference) to refer to it.
You want to replace 1+ special chars other than - and ' before the word diagnosis, thus you may use
re.sub(r"[^\w'-]+(?=diagnosis)", "_", s2)
See this regex demo.
Details
[^\w'-]+ - any non-word char excluding ' and _
(?=diagnosis) - a positive lookahead that does not consume the text (does not add to the match value and thus re.sub does not remove this piece of text) but just requires diagnosis text to appear immediately to the right of the current location.
Or
re.sub(r"[^\w'-]+(diagnosis)", r"_\1", s2)
See this regex demo. Here, [^\w'-]+ also matches those special chars, but (diagnosis) is a capturing group whose text can be referred to using the \1 placeholder from the replacement pattern.
NOTE: If you want to make sure diagnosis is matched as a whole word, use \b around it, \bdiagnosis\b (mind the r raw string literal prefix!).

Categories