Regex: exception to negative character class - python

Using Python with Matthew Barnett's regex module.
I have this string:
The well known *H*rry P*tter*.
I'm using this regex to process the asterisks to obtain <em>H*rry P*tter</em>:
REG = re.compile(r"""
(?<!\p{L}|\p{N}|\\)
\*
([^\*]*?) # I need this part to deal with nested patterns; I really can't omit it
\*
(?!\p{L}|\p{N})
""", re.VERBOSE)
PROBLEM
The problem is that this regex doesn't match this kind of strings unless I protect intraword asterisks first (I convert them to decimal entities), which is awfully expensive in documents with lots of asterisks.
QUESTION
Is it possible to tell the negative class to block at internal asterisks only if they are not surrounded by word characters?
I tried these patterns in vain:
([^(?:[^\p{L}|\p{N}]\*[^\p{L}|\p{N}])]*?)
([^(?<!\p{L}\p{N})\*(?!\p{L}\p{N})]*?)

I suggest a single regex replacement for the cases like you mentioned above:
re.sub(r'\B\*\b([^*]*(?:\b\*\b[^*]*)*)\b\*\B', r'<em>\1</em>', s)
See the regex demo
Details:
\B\*\b - a * that is preceded with a non-word boundary and followed with a word boundary
([^*]*(?:\b\*\b[^*]*)*) - Group 1 capturing:
[^*]* - 0+ chars other than *
(?:\b\*\b[^*]*)* - zero or more sequences of:
\b\*\b - a * enclosed with word boundaries
[^*]* - 0+ chars other than *
\b\*\B - a * that is followed with a non-word boundary and preceded with a word boundary
More information on word boundaries and non-word boundaries:
Word boundaries at regular-expressions.info
Difference between \b and \B in regex
What are non-word boundary in regex (\B), compared to word-boundary?

Related

Regex python ignore word followed by given character

I have the regex (?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9-_]+)(?!\w).
Given the string #first#nope #second#Hello #my-friend, email# whats.up#example.com #friend, what can I do to exclude the strings #first and #second since they are not whole words on their own ?
In other words, exclude them since they are succeeded by # .
You can use
(?<![a-zA-Z0-9_.-])#(?=([A-Za-z]+[A-Za-z0-9_-]*))\1(?![#\w])
(?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w])
See the regex demo. Details:
(?<![a-zA-Z0-9_.-]) - a negative lookbehind that matches a location that is not immediately preceded with ASCII digits, letters, _, . and -
# - a # char
(?=([A-Za-z]+[A-Za-z0-9_-]*)) - a positive lookahead with a capturing group inside that captures one or more ASCII letters and then zero or more ASCII letters, digits, - or _ chars
\1 - the Group 1 value (backreferences are atomic, no backtracking is allowed through them)
(?![#\w]) - a negative lookahead that fails the match if there is a word char (letter, digit or _) or a # char immediately to the right of the current location.
Note I put hyphens at the end of the character classes, this is best practice.
The (?a)(?<![\w.-])#(?=([A-Za-z][\w-]*))\1(?![#\w]) alternative uses shorthand character classes and the (?a) inline modifier (equivalent of re.ASCII / re.A makes \w only match ASCII chars (as in the original version). Remove (?a) if you plan to match any Unicode digits/letters.
Another option is to assert a whitespace boundary to the left, and assert no word char or # sign to the right.
(?<!\S)#([A-Za-z]+[\w-]+)(?![#\w])
The pattern matches:
(?<!\S) Negative lookbehind, assert not a non whitespace char to the left
# Match literally
([A-Za-z]+[\w-]+) Capture group1, match 1+ chars A-Za-z and then 1+ word chars or -
(?![#\w]) Negative lookahead, assert not # or word char to the right
Regex demo
Or match a non word boundary \B before the # instead of a lookbehind.
\B#([A-Za-z]+[\w-]+)(?![#\w])
Regex demo

word boundary \b doesn't work on string with dot in Python regex [duplicate]

For example: George R.R. Martin
I want to match only George and Martin.
I have tried: \w+\b. But doesn't work!
The \w+\b. matches 1+ word chars that are followed with a word boundary, and then any char that is a non-word char (as \b restricts the following . subpattern). Note that this way is not negating anything and you miss an important thing: a literal dot in the regex pattern must be escaped.
You may use a negative lookahead (?!\.):
var s = "George R.R. Martin";
console.log(s.match(/\b\w+\b(?!\.)/g));
See the regex demo
Details:
\b - leading word boundary
\w+ - 1+ word chars
\b - trailing word boundary
(?!\.) - there must be no . after the last word char matched.
See more about how negative lookahead works here.

Python regex: removing all special characters and numbers NOT attached to words

I am trying to remove all special characters and numbers in python, except numbers that are directly attached to words.
I have succeded in doing this for all cases of special characters and numbers attached and not attached to words, how to do it in such a way that numbers attached are not matched.
Here's what I did:
import regex as re
string = "win32 backdoor guid:64664646 DNS-lookup h0lla"
re.findall(r'[^\p{P}\p{S}\s\d]+', string.lower())
I get as output
win backdoor guid DNS lookup h lla
But I want to get:
win32 backdoor guid DNS lookup h0lla
demo: https://regex101.com/r/x4HrGo/1
To match alphanumeric strings or only letter words you may use the following pattern with re:
import re
# ...
re.findall(r'(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*|[^\W\d_]+', text.lower())
See the regex demo.
Details
(?:[^\W\d_]+\d|\d+[^\W\d_])[^\W_]*- either 1+ letters followed with a digit, or 1+ digits followed with a letter, and then 0+ letters/digits
| - or
[^\W\d_]+ - either any 1+ Unicode letters
NOTE It is equivalent to \d*[^\W\d_][^\W_]* pattern posted by PJProudhon, that matches any 1+ alphanumeric character chunks with at least 1 letter in them.
You could give a try to \b\d*[^\W\d_][^\W_]*\b
Decomposition:
\b # word boundary
/d* # zero or more digits
[^\W\d_] # one alphabetic character
[^\W_]* # zero or more alphanumeric characters
\b # word boundary
For beginners:
[^\W] is typical double negated construct. Here you want to match any character which is not alphanumeric or _ (\W is the negation of \w, which matches any alphanumeric character plus _ - common equivalent [a-zA-Z0-9_]).
It reveals useful here to compose:
Any alphanumeric character = [^\W_] matches any character which is not non-[alphanumeric or _] and is not _.
Any alphabetic character = [^\W\d_] matches any character which is not non-[alphanumeric or _] and is not digit (\d) and is not _.
Some further reading here.
Edit:
When _ is also considered a word delimiter, just skip the word boundaries, which toggle on that character, and use \d*[^\W\d_][^\W_]*.
Default greediness of star operator will ensure all relevant characters are actually matched.
Demo.
Try this RegEx instead:
([A-Za-z]+(\d)*[A-Za-z]*)
You can expand it from here, for example flipping the * and + on the first and last sets to capture string like "win32" and "01ex" equally.

Invalid pattern in look-behind

Why does this regex work in Python but not in Ruby:
/(?<!([0-1\b][0-9]|[2][0-3]))/
Would be great to hear an explanation and also how to get around it in Ruby
EDIT w/ the whole line of code:
re.sub(r'(?<!([0-1\b][0-9]|[2][0-3])):(?!([0-5][0-9])((?i)(am)|(pm)|(a\.m)|(p\.m)|(a\.m\.)|(p\.m\.))?\b)' , ':\n' , s)
Basically, I'm trying to add '\n' when there is a colon and it is not a time.
Ruby regex engine doesn't allow capturing groups in look behinds.
If you need grouping, you can use a non-capturing group (?:):
[8] pry(main)> /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
SyntaxError: (eval):2: invalid pattern in look-behind: /(?<!(:?[0-1\b][0-9]|[2][0-3]))/
[8] pry(main)> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
=> /(?<!(?:[0-1\b][0-9]|[2][0-3]))/
Docs:
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed-width.
But top-level alternatives can be of various lengths.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
In negative look-behind, capturing group isn't allowed,
but non-capturing group (?:) is allowed.
Learned from this answer.
Acc. to Onigmo regex documentation, capturing groups are not supported in negative lookbehinds. Although it is common among regex engines, not all of them count it as an error, hence you see the difference in the re and Onigmo regex libraries.
Now, as for your regex, it is not working correctly nor in Ruby nor in Python: the \b inside a character class in a Python and Ruby regex matches a BACKSPACE (\x08) char, not a word boundary. Moreover, when you use a word boundary after an optional non-word char, if the char appears in the string a word char must appear immediately to the right of that non-word char. The word boundary must be moved to right after m before \.?.
Another flaw with the current approach is that lookbehinds are not the best to exclude certain contexts like here. E.g. you can't account for a variable amount of whitespaces between the time digits and am / pm. It is better to match the contexts you do not want to touch and match and capture those you want to modify. So, we need two main alternatives here, one matching am/pm in time strings and another matching them in all other contexts.
Your pattern also has too many alternatives that can be merged using character classes and ? quantifiers.
Regex demo
\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?):
\b - word boundary
((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?) - capturing group 1:
(?:[01]?[0-9]|2[0-3]) - an optional 0 or 1 and then any digit or 2 and then a digit from 0 to 3
:[0-5][0-9] - : and then a number from 00 to 59
\s* - 0+ whitespaces
[pa]\.?m\b\.? - a or p, an optional dot, m, a word boundary, an optional dot
| - or
\b[ap]\.?m\b\.? - word boundary, a or p, an optional dot, m, a word boundary, an optional dot
Python fixed solution:
import re
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = r'\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?'
result = re.sub(rx, lambda x: x.group(1) if x.group(1) else "\n", text, flags=re.I)
Ruby solution:
text = 'am pm P.M. 10:56pm 10:43 a.m.'
rx = /\b((?:[01]?[0-9]|2[0-3]):[0-5][0-9]\s*[pa]\.?m\b\.?)|\b[ap]\.?m\b\.?/i
result = text.gsub(rx) { $1 || "\n" }
Output:
"\n \n \n 10:56pm 10:43 a.m."
For sure #mrzasa found the problem out.
But ..
Taking a guess at your intent to replace a non-time colon with a ':\n`
it could be done like this I guess. Does a little whitespace trim as well.
(?i)(?<!\b[01][0-9])(?<!\b[2][0-3])([^\S\r\n]*:)[^\S\r\n]*(?![0-5][0-9](?:[ap]\.?m\b\.?)?)
PCRE - https://regex101.com/r/7TxbAJ/1 Replace $1\n
Python - https://regex101.com/r/w0oqdZ/1 Replace \1\n
Readable version
(?i)
(?<!
\b [01] [0-9]
)
(?<!
\b [2] [0-3]
)
( # (1 start)
[^\S\r\n]*
:
) # (1 end)
[^\S\r\n]*
(?!
[0-5] [0-9]
(?: [ap] \.? m \b \.? )?
)

extract string using regular expression

fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'(Ubuntu)\b(\d+[.]\d+)\b')
fix_release = p.search(fix_release)
logger.info(fix_release) #fix_release is None
I want to extract the string 'Ubuntu 16.04'
But, result is None.... How can I extract the correct sentence?
You confused the word boundary \b with white space, the former matches the boundary between a word character and a non word character and consumes zero character, you can simply use r'Ubuntu \d+\.\d+' for your case:
fix_release='Ubuntu 16.04 LTS'
p = re.compile(r'Ubuntu \d+\.\d+')
p.search(fix_release).group(0)
# 'Ubuntu 16.04'
Try this Regex:
Ubuntu\s*\d+(?:\.\d+)?
Click for Demo
Explanation:
Ubuntu - matches Ubuntu literally
\s* - matches 0+ occurrences of a white-space, as many as possible
\d+ - matches 1+ digits, as many as possible
(?:\.\d+)? - matches a . followed by 1+ digits, as many as possible. A ? at the end makes this part optional.
Note: In your regex, you are using \b for the spaces. \b returns 0 length matches between a word-character and a non-word character. You can use \s instead

Categories